This tutorial explains how to produce summary statistics (e.g., means, counts) using the core DataSHIELD functionality and additional functions from the dsHelper
package.
library(dsBaseClient)
# install.packages("remotes")
# remotes::install_github("timcadman/ds-helper")
library(dsHelper)
# install.packages("tibble")
library(tibble)
# install.packages("readr")
library(readr)
Two important functions allow you to view the dimensions and column names of a data frame.
We can view the mean and variance of a variable using:
ds.mean("depression$age")
## $Mean.by.Study
## EstimatedMean Nmissing Nvalid Ntotal
## server1 58.97533 0 570 570
## server2 58.95321 0 570 570
##
## $Nstudies
## [1] 2
##
## $ValidityMessage
## ValidityMessage
## server1 "VALID ANALYSIS"
## server2 "VALID ANALYSIS"
ds.var("depression$age")
## $Variance.by.Study
## EstimatedVar Nmissing Nvalid Ntotal
## server1 1.0607515 0 570 570
## server2 0.9333989 0 570 570
##
## $Nstudies
## [1] 2
##
## $ValidityMessage
## ValidityMessage
## server1 "VALID ANALYSIS"
## server2 "VALID ANALYSIS"
Summaries of categorical variables can be retrieved using the table
function:
ds.table("cnsim$GENDER")
##
## Data in all studies were valid
##
## Study 1 : No errors reported from this study
## Study 2 : No errors reported from this study
## $output.list
## $output.list$TABLE_rvar.by.study_row.props
## study
## cnsim$GENDER server1 server2
## 0 0.4079193 0.5920807
## 1 0.4160839 0.5839161
## NA NaN NaN
##
## $output.list$TABLE_rvar.by.study_col.props
## study
## cnsim$GENDER server1 server2
## 0 0.5048544 0.5132772
## 1 0.4951456 0.4867228
## NA 0.0000000 0.0000000
##
## $output.list$TABLE_rvar.by.study_counts
## study
## cnsim$GENDER server1 server2
## 0 1092 1585
## 1 1071 1503
## NA 0 0
##
## $output.list$TABLES.COMBINED_all.sources_proportions
## cnsim$GENDER
## 0 1 NA
## 0.51 0.49 0.00
##
## $output.list$TABLES.COMBINED_all.sources_counts
## cnsim$GENDER
## 0 1 NA
## 2677 2574 0
##
##
## $validity.message
## [1] "Data in all studies were valid"
The function ds.summary
is analogous to base summary()
and returns concise summary statistics based on the variable type:
ds.summary("depression$age")
## $server1
## $server1$class
## [1] "numeric"
##
## $server1$length
## [1] 570
##
## $server1$`quantiles & mean`
## 5% 10% 25% 50% 75% 90% 95% Mean
## 57.41522 57.67204 58.15366 59.03113 59.60176 60.36994 60.68646 58.97533
##
##
## $server2
## $server2$class
## [1] "numeric"
##
## $server2$length
## [1] 570
##
## $server2$`quantiles & mean`
## 5% 10% 25% 50% 75% 90% 95% Mean
## 57.38526 57.75520 58.28263 58.94060 59.57365 60.08917 60.62116 58.95321
ds.summary("cnsim$GENDER")
## $server1
## $server1$class
## [1] "factor"
##
## $server1$length
## [1] 2163
##
## $server1$categories
## [1] "0" "1"
##
## $server1$`count of '0'`
## [1] 1092
##
## $server1$`count of '1'`
## [1] 1071
##
##
## $server2
## $server2$class
## [1] "factor"
##
## $server2$length
## [1] 3088
##
## $server2$categories
## [1] "0" "1"
##
## $server2$`count of '0'`
## [1] 1585
##
## $server2$`count of '1'`
## [1] 1503
DataSHIELD will not return summary statistics if it is potentially disclosive. For example, the standard setting will not allow counts of categorical variables to be returned if any cell count is less than 3:
ds.summary("cnsim$DIS_AMI")
## $server1
## [1] "INVALID object!"
##
## $server2
## [1] "INVALID object!"
Whilst core DataSHIELD functions return all the information you need, sometimes many lines of code are required and the output can be messy. To help with this, the dsHelper
package allows you to do common operations in a more streamlined way and return neater results.
For example, using dsHelper
you can summarise multiple variables within a dataframe:
cnsim_stats <- dh.getStats("cnsim")
cnsim_stats
## $categorical
## # A tibble: 48 × 10
## variable cohort category value cohort_n valid_n missing_n perc_valid
## <chr> <chr> <fct> <int> <int> <int> <int> <dbl>
## 1 DIS_CVA combined 0 5248 5251 5251 0 99.9
## 2 DIS_CVA combined 1 3 5251 5251 0 0.06
## 3 DIS_CVA combined <NA> 0 5251 NA NA NA
## 4 DIS_DIAB combined 0 5174 5251 5251 0 98.5
## 5 DIS_DIAB combined 1 77 5251 5251 0 1.47
## 6 DIS_DIAB combined <NA> 0 5251 NA NA NA
## 7 GENDER combined 0 2677 5251 5251 0 51.0
## 8 GENDER combined 1 2574 5251 5251 0 49.0
## 9 GENDER combined <NA> 0 5251 NA NA NA
## 10 MEDI_LPD combined 0 5144 5251 5251 0 98.0
## # ℹ 38 more rows
## # ℹ 2 more variables: perc_missing <dbl>, perc_total <dbl>
##
## $continuous
## # A tibble: 18 × 15
## variable cohort mean std.dev perc_5 perc_10 perc_25 perc_50 perc_75
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 id serve… 6.12 1.38 4.09 4.53 5.24 6.07 6.87
## 2 id serve… 6.1 1.37 4.1 4.55 5.21 6.03 6.85
## 3 LAB_GLUC… serve… 1.57 0.41 0.88 1.05 1.3 1.58 1.84
## 4 LAB_GLUC… serve… 1.56 0.42 0.85 1.03 1.29 1.56 1.84
## 5 LAB_HDL serve… 2.1 1.58 -0.36 0.17 1.04 2.11 3.12
## 6 LAB_HDL serve… 2.05 1.58 -0.55 0.01 1 2.05 3.06
## 7 LAB_TRIG serve… 5.87 1.11 4.09 4.49 5.11 5.86 6.59
## 8 LAB_TRIG serve… 5.85 1.07 4.14 4.51 5.15 5.82 6.51
## 9 LAB_TSC serve… 27.4 5.02 19.5 21.1 24.1 27.3 30.7
## 10 LAB_TSC serve… 27.5 4.9 19.4 21.3 24.2 27.4 30.7
## 11 PM_BMI_C… serve… 1082 625. 109. 217. 542. 1082 1622.
## 12 PM_BMI_C… serve… NA NA NA NA NA NA NA
## 13 LAB_GLUC… combi… 1.56 1.58 0.86 1.04 1.3 1.57 1.84
## 14 LAB_HDL combi… 2.07 1.09 -0.47 0.08 1.02 2.08 3.09
## 15 LAB_TRIG combi… 5.86 4.95 4.12 4.5 5.13 5.83 6.54
## 16 LAB_TSC combi… 27.4 625. 19.5 21.2 24.2 27.4 30.7
## 17 PM_BMI_C… combi… 1082 1.38 109. 217. 542. 1082 1622.
## 18 id combi… 6.11 0.42 4.1 4.54 5.22 6.05 6.86
## # ℹ 6 more variables: perc_90 <dbl>, perc_95 <dbl>, valid_n <dbl>,
## # cohort_n <dbl>, missing_n <dbl>, missing_perc <dbl>
You can also use an additional function to output the table in publication-ready format:
var_labels <- tibble(
variable = c(
"LAB_TSC", "LAB_TRIG", "LAB_HDL", "LAB_GLUC_ADJUSTED",
"PM_BMI_CONTINUOUS", "DIS_CVA", "MEDI_LPD", "DIS_DIAB",
"GENDER", "PM_BMI_CATEGORICAL"
),
var_label = c(
"Total cholesterol", "Triglycerides", "HDL cholesterol", "Adjusted glucose",
"BMI (continuous)", "Stroke", "Lipid medication", "Diabetes",
"Gender", "BMI (categorical)"
)
)
cat_labels <- tibble(
variable = c(
"DIS_CVA", "DIS_CVA",
"DIS_DIAB", "DIS_DIAB",
"GENDER", "GENDER",
"MEDI_LPD", "MEDI_LPD",
"PM_BMI_CATEGORICAL", "PM_BMI_CATEGORICAL", "PM_BMI_CATEGORICAL"
),
category = c(
"0", "1",
"0", "1",
"0", "1",
"0", "1",
"1", "2", "3"
),
cat_label = c(
"No stroke", "Stroke",
"No diabetes", "Diabetes",
"Male", "Female",
"No lipid medication", "Lipid medication",
"Normal weight", "Overweight", "Obese"
)
)
dh.createTableOne(
stats = cnsim_stats,
vars = var_labels$variable,
var_labs = var_labels,
cat_labs = cat_labels,
type = "both",
cont_format = "mean_sd",
inc_missing = TRUE,
perc_denom = "total"
)
## # A tibble: 26 × 5
## variable category server1 server2 combined
## <chr> <chr> <chr> <chr> <chr>
## 1 Total cholesterol Mean ± SD 27.4 ± 5.02 27.5 ± 4.9 27.4 ± 625
## 2 Total cholesterol <NA> 97 (4.48) 150 (4.86) 3088 (58.8)
## 3 Triglycerides Mean ± SD 5.87 ± 1.11 5.85 ± 1.07 5.86 ± 4.95
## 4 Triglycerides <NA> 356 (16.5) 549 (17.8) 247 (4.7)
## 5 HDL cholesterol Mean ± SD 2.1 ± 1.58 2.05 ± 1.58 2.07 ± 1.09
## 6 HDL cholesterol <NA> 362 (16.7) 562 (18.2) 905 (17.2)
## 7 Adjusted glucose Mean ± SD 1.57 ± 0.41 1.56 ± 0.42 1.56 ± 1.58
## 8 Adjusted glucose <NA> 360 (16.6) 555 (18) 924 (17.6)
## 9 BMI (continuous) Mean ± SD 1080 ± 625 NA ± NA 1080 ± 1.38
## 10 BMI (continuous) <NA> 0 (0) 3088 (100) 846 (16.1)
## # ℹ 16 more rows
This returns a list of two tibbles, separated into continuous and categorical information.
For categorical variables, information is returned on counts, percentages, and missingness within each category.
For continuous variables, information is returned on mean, standard deviation, quantiles, and missingness.
Returning summary statistics for repeated measures data is more complicated as it involves grouping participants. A wrapper function in the dsHelper
package manages this:
rm_stats <- dh.getRmStats(
"depression",
outcome = "hads_total",
id = "ID",
age = "time")
rm_stats
## # A tibble: 3 × 11
## cohort min_age max_age n_obs n_obs_na n_participants n_meas_5 n_meas_med
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 serve… 1.85 43.4 570 0 278 1 1
## 2 serve… 1.67 42.7 570 0 283 1 2
## 3 combi… 1.76 43.0 1140 0 561 1 1.50
## # ℹ 3 more variables: n_meas_95 <dbl>, n_participants_total <int>,
## # n_participants_na <dbl>
(Section content not provided. Placeholder.)
An important fact to note in DataSHIELD is that these results can be assigned to an object within a local R session. This is because such results do not disclose individual-level data. By saving these results to local R objects, we can reuse them (e.g., to make tables and graphs for publications).
table_one <- dh.createTableOne(
stats = cnsim_stats,
vars = var_labels$variable,
var_labs = var_labels,
cat_labs = cat_labels,
type = "both",
cont_format = "mean_sd",
inc_missing = TRUE,
perc_denom = "total"
)
You can then export this, e.g. as a .csv file:
write_csv(table_one, "table_one.csv")
datashield.logout(conns)