This tutorial explains how to produce summary statistics (e.g. means, counts) using the core DataSHIELD
functionality and additional functions from the dsHelper package.
Two important functions allow you to view the dimensions and column names of a data frame:
We can view the mean and variablce of a variable using:
ds.mean(x = "iris$Sepal.Length")
ds.var(x = "iris$Sepal.Length")
Summaries of categorical variables can be retrieved using the table
function:
ds.table("iris$Species")
The function ds.summary
is analogous to base summary
and will return more concise summary statistics
based on the variable type
ds.summary("iris$Sepal.Length")
ds.summary("iris$Species")
Whilst core DataSHIELD functions return all the information you need, sometimes many lines of
code can be required, and the output can be quite messy. To help with this, the dsHelper
package
allows you to do some common operations in a more streamlined way, and return neater results.
For example, using dsHelper
you can summarise not one but many different variables within a
dataframe:
dh.getStats(
df = "iris",
vars = c("Sepal.Length", "Sepal.Width", "Species"))
Here you see this has returned a list of two tibbles separated into continuous and categorical
information. For the categorical variables info is returned on ns, percentages and missingness
within each category, whilst for continuous variables info is returned on mean, standard deviation,
quantiles and also missingness.
A important fact to note in DataSHIELD is that these results can be assigned to an object
within a local R session. This is because such results do not disclose individual level data. By
saving these results to local R objects we can reuse them (e.g. to make tables and graphs for
publications)
my_stats <- dh.getStats(
df = "iris",
vars = c("Sepal.Length", "Sepal.Width", "Species"))
my_stats