Prerequisites
You will need to install R Studio desktop on your own machine
Info
This DataSHIELD user tutorial was initially prepared for the DataSHIELD conference user workshop on the 11th October 2023, and is regularly updated.
You can download the R Markdown notebook for this tutorial. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.Alternatively you can copy and paste the code blocks below directly into R/Rstudio.
Firstly: check whether we have the right R packages installed to run DataSHIELD: using the very helpful devtools package, we'll use the "Session info" command:
devtools::session_info()
We are missing some of the necessary packages: "DSI", "DSOpal" and "dsBaseClient".
install.packages('DSI')
install.packages('DSOpal')
install.packages('dsBaseClient', repos='http://cran.obiba.org', type='source')
Remember to load them into this R session using "library()" command:
library(DSI)
library(DSOpal)
library(dsBaseClient)
Check that they have now been added:
devtools::session_info()
The login script has to be customized to fit the data you are trying to connect to.
The "builder <-" and "builder$append" functions are standard features.
For this demonstration we are connecting to simulated data- but if it was data of real people, it would be very important for us not to be able to see individual patients' information.
Let's log in to the Opal demo server as a study administrator to see what data are available. Username = administrator , password = password.
Select the CNSIM synthetic dataset which comprises three data tables CNSIM1, CNSIM2, CNSIM3. If you click on CNSIM3 you can then navigate through all the variables available in that table including metadata e.g. if you click on LAB_TSC and the summary tab metadata is displayed.
We will use the simulated dataset CNSIM, in a data.frame with 4128 observations of 11 harmonized variables. The CNSIM dataset contains synthetic data based on a model derived from the participants of the 1958 Birth Cohort, as part of the obesity methodological development project. This dataset does contain some NA values.
For the ease of this workshop, we'll imagine that each study is hosted by a different partner of DataSHIELD: the first by UMCG Groningen (where EUCAN-Connect is based), the second by Liverpool University (where DataSHIELD core team is based) and the third by Barcelona (where ISGlobal is based). The below code creates a local R object with the login details for each study:
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org", user = "dsuser", password = "P@ssw0rd", table = "CNSIM.CNSIM1", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)')
builder$append(server = "study2", url = "https://opal-demo.obiba.org", user = "dsuser", password = "P@ssw0rd", table = "CNSIM.CNSIM2", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)')
builder$append(server = "study3", url = "https://opal-demo.obiba.org", user = "dsuser", password = "P@ssw0rd", table = "CNSIM.CNSIM3", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)')
logindata <- builder$build()
The error message displayed here- in production, would use https. But no need here (a self-contained environment, no risks).
Now we need to connect, referring to the login information in the data frame we have just created:
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")
The 'assign' argument can be set to either 'TRUE' or 'FALSE'. If set to 'TRUE', all the available variables within that table will be assigned to a serverside data frame and available to access. If you only need a small subset of available variables it can be preferable to set this to 'FALSE' and later use the function 'datashield.assign' to separately assign only the variables you need. The output of this box has useful progress bars which show the progress of connecting to studies, one by one.
We can see the serverside object that has been created by running:
ds.ls()
Here you see one dataframe in each study called 'D' (this name was set using the 'symbol' argument in datashield.login above).
Here you will be able to check on the metadata of what you are studying, to get ideas about what commands you may want to run on which variable.
A list of all functions available in dsBaseClient is available, many of which are exploratory analysis.
Let's try out a few of these:
ds.dim(x="D", datasources= connections)
ds.colnames(x="D", datasources= connections)
This is a measure of HDL Cholesterol (aka the "good cholesterol" level)
Let's run some summary statistic commands
ds.class(x='D$LAB_HDL', datasources = connections)
ds.length(x='D$LAB_HDL', datasources = connections)
ds.mean(x='D$LAB_HDL', datasources = connections)
What is this other function to obtain the mean? Let's use the DataSHIELD function help documentation.
?ds.quantileMean
Now, putting into action some of what we've learned about the function arguments.
ds.quantileMean(x='D$LAB_HDL', datasources = connections)
ds.quantileMean(x='D$LAB_HDL',type = "split", datasources = connections)
Trying to calculate the variance of HDL Cholesterol:
?ds.var
ds.var(x = "D$LAB_HDL", type = "split", datasources = connections)
Can we store the results calculated from a DataSHIELD analysis in a local R session?
Yes- the output of aggregate functions are always R objects, hence can be stored.
a<-ds.var(x = "D$LAB_HDL", type = "split", datasources = connections)[[1]]
a
b<-ds.var(x = "D$LAB_HDL", type = "split", datasources = connections)[[1]][[1,1]]
b
The square brackets are because we are trying to access an element of a list- which is the R object that DataSHIELD aggregate functions output as.
Assign-type functions are ones where a calculation is done on the data stored at the server (and results of that calculation are "assigned" to a serverside variable, and saved there), but is NOT transmitted back to the user.
The reason for this is that some calculations could be highly disclosive- and if such data were transmitted to the analyst, with not much effort at all, with an inverse function, the analyst could work out exactly what the raw data are- and thus the data's privacy is broken!
To demonstrate:
ds.ls(datasources = connections)
ds.log(x='D$LAB_HDL', newobj='LAB_HDL_log', datasources = connections)
ds.ls(datasources = connections)
ds.mean(x="LAB_HDL_log",datasources= connections)
ds.mean(x="D$LAB_HDL",datasources= connections)
The second "ds.mean" shows that the mean of the logged values are definitely smaller, by about the right amount. The DataSHIELD log function has worked.
There is another DataSHIELD assign function that can be used for data transformations - a square root function. Let's test again:
ds.sqrt(x='D$LAB_HDL', newobj='LAB_HDL_sqrt', datasources = connections)
ds.ls(datasources = connections)
ds.mean(x="LAB_HDL_sqrt",datasources= connections)
ds.mean(x="D$LAB_HDL",datasources= connections)
These new objects are not attached to a dataframe.
Use the help function to find out about the ds.dataFrame function, which can be used to combine objects.
Now join "LAB_HDL_sqrt" and "LAB_HDL_log" to the dataframe "D".
ds.dataFrame(c("D", "LAB_HDL_sqrt", "LAB_HDL_log"), newobj = "D")
ds.colnames("D")
Using some of the functions above, explore the distribution of the variable "PM_BMI_CATEGORICAL" in dataframe "D".
In DataSHIELD there is one function that allows sub-setting of data, ds.dataFrameSubset .
You may wish to use it to:
Subset a column of data by its "Class"
Subset a dataframe to remove any "NA"s
Subset a numeric column of a dataframe using a Boolean inequalilty
# first find the column name you wish to refer to
ds.colnames(x="D")
# then check which levels you need to apply a boolean operator to:
ds.levels(x="D$GENDER")
?ds.dataFrameSubset
Splitting into GENDER groups, assigned to different server-side objects.
ds.dataFrameSubset(df.name = "D", V1.name = "D$GENDER", V2.name = "1", Boolean.operator = "==", newobj = "CNSIM.subset.Males", datasources= connections)
ds.dataFrameSubset(df.name = "D", V1.name = "D$GENDER", V2.name = "0", Boolean.operator = "==", newobj = "CNSIM.subset.Females",datasources= connections)
Now there are two serverside objects which have split GENDER by class, to which we have assigned the names "CNSIM.subset.Males" and "CNSIM.subset.Females".
ds.completeCases(x1="D",newobj="D_without_NA", datasources=connections)
Say we wanted to have a subset of patients where BMI values are ≥ 25, and call it subset.BMI.25.plus
ds.dataFrameSubset(df.name = "D",
V1.name = "D$PM_BMI_CONTINUOUS",
V2.name = "25",
Boolean.operator = ">=",
newobj = "subset.BMI.25.plus",
datasources = connections)
Checking we have successfully created such an object, using quantiles and histograms:
ds.quantileMean(x="subset.BMI.25.plus$PM_BMI_CONTINUOUS", type = "split", datasources= connections)
ds.histogram(x="subset.BMI.25.plus$PM_BMI_CONTINUOUS", datasources = connections)
If we want to create a subset based on multiple conditions we can use the ds.Boole function before subsetting. For example, let's say that we want to create a subset of individuals where BMI values are ≥ 25 and adjusted glucose is lower than 6.
ds.Boole(
V1 = "D$PM_BMI_CONTINUOUS",
V2 = "25",
Boolean.operator = ">=",
numeric.output = TRUE,
newobj = "BMI.25.plus",
datasources = connections)
ds.Boole(
V1 = "D$LAB_GLUC_ADJUSTED",
V2 = "6",
Boolean.operator = "<",
numeric.output = TRUE,
newobj = "GLUC.6.less",
datasources = connections)
We can then use the ds.make function to make a new categorical variable which combines these groups:
?ds.make #
ds.make(toAssign = "BMI.25.plus+GLUC.6.less",
newobj = "BMI.25.plus_GLUC.6.less",
datasources = connections)
# If BMI >= 25 and glucose < 6, then BMI.25.plus_GLUC.6.less=2
# If BMI >= 25 and glucose >= 6, then BMI.25.plus_GLUC.6.less=1
# If BMI < 25 and glucose < 6, then BMI.25.plus_GLUC.6.less=1
# If BMI < 25 and glucose >= 6, then BMI.25.plus_GLUC.6.less=0
ds.table(rvar= "BMI.25.plus_GLUC.6.less",
datasources = connections)
ds.dataFrame(x=c("D", "BMI.25.plus_GLUC.6.less"), newobj = "D2")
ds.colnames("D2")
ds.dataFrameSubset(df.name = "D2",
V1.name = "D2$BMI.25.plus_GLUC.6.less",
V2.name = "2",
Boolean.operator = "==",
newobj = "subset2",
datasources = connections)
ds.dim("subset2")
Visualising the data we are studying is extremely important to get a sense of it. While it may seem disclosive at first glance, only such graphs that are definitively non-disclosive have been implemented within the DataSHIELD project.
Firstly, histograms give a good sense of how one variable is distributed. But no individual points are disclosed because values are "binned" into groups of a similar magnitude, disguising what each one actually is. We protect privacy by removing bins with low counts (below specific threshold). If you have a symmetric distribution, you may find some things aren't observed at the extreme ends.
Let's create a histogram of the variable we've been investigating for much of this study: HDL Cholesterol ("LAB_HDL").
?ds.histogram
ds.histogram(x='D$LAB_HDL', datasources = connections)
Use the ds.histogram to explore the distribution of "D$PM_BMI_CONTINUOUS"
When you generate a scatter plot, you can say that the data points that are displayed are not the actual values. The function gives you the choice on how to anonymise: either you anonymise the values by additional random noise; or you take the average of the k nearest neighbours. (for more details on how anonymisation methods are used for the generation of privacy-preserving visualisations you can have a look on the paper https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-020-00257-4)
ds.scatterPlot(x="D$LAB_HDL", y="D$PM_BMI_CONTINUOUS", datasources = connections)
Other DataSHIELD graphical functions allow the creation of box plots, heatmap plots and contour plots. Investigate them using their help functions:
?ds.heatmapPlot
?ds.contourPlot
?ds.boxPlot
We want to examine the relationship between BMI and HDL Cholesterol
ds.cor(x='D$PM_BMI_CONTINUOUS', y='D$LAB_HDL')
Regress HDL Cholesterol with BMI using the Individual Partition Data (IPD) approach:
The method for this (ds.glm) is a "pooled analysis"- equivalent to placing the individual-level data from all sources in one warehouse.
Important to note that the link function is by default the canonical link function for each family. So binomial <-> logistic link, poisson <-> log link, gaussian <-> identity link.
ds.glm(formula = "D$LAB_HDL~D$PM_BMI_CONTINUOUS", family="gaussian", datasources = connections)
Regress HDL Cholesterol with BMI using the Study-Level Meta-Analysis (SLMA) approach:
ds.glmSLMA(formula = "D$LAB_HDL~D$PM_BMI_CONTINUOUS", family="gaussian", newobj = "workshop.obj", datasources = connections)
For the SLMA approach we can assign the predicted values at each study:
ds.glmPredict(glmname = "workshop.obj", newobj = "workshop.prediction.obj", datasources = connections)
ds.length("workshop.prediction.obj$fit", datasources=connections)
ds.length("D$LAB_HDL", datasources=connections)
ds.cbind(c('D$LAB_HDL', 'D$PM_BMI_CONTINUOUS'), newobj='vars')
ds.completeCases('vars', newobj='vars.complete')
ds.dim('vars.complete')
Let's plot the best linear fit on a scatter plot
df1 <- ds.scatterPlot('D$PM_BMI_CONTINUOUS', "D$LAB_HDL", datasources = connections, return.coords = TRUE)
df2 <- ds.scatterPlot('vars.complete$PM_BMI_CONTINUOUS', "workshop.prediction.obj$fit", datasources = connections, return.coords = TRUE)
# then in native R
par(mfrow=c(2,2))
plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='Body Mass Index', ylab='HDL Cholesterol', main='Study 1')
lines(as.data.frame(df2[[1]][[1]])$x,as.data.frame(df2[[1]][[1]])$y, col='red')
plot(as.data.frame(df1[[1]][[2]])$x,as.data.frame(df1[[1]][[2]])$y, xlab='Body Mass Index', ylab='HDL Cholesterol', main='Study 2')
lines(as.data.frame(df2[[1]][[2]])$x,as.data.frame(df2[[1]][[2]])$y, col='red')
plot(as.data.frame(df1[[1]][[3]])$x,as.data.frame(df1[[1]][[3]])$y, xlab='Body Mass Index', ylab='HDL Cholesterol', main='Study 3')
lines(as.data.frame(df2[[1]][[3]])$x,as.data.frame(df2[[1]][[3]])$y, col='red')
For the SLMA approach we can also create the predicted values and the residuals at each study using the ds.make function:
glmslma <- ds.glmSLMA(formula = "vars.complete$LAB_HDL~vars.complete$PM_BMI_CONTINUOUS", family="gaussian", newobj = "workshop.obj", datasources = connections)
ds.make(toAssign=paste0("(",glmslma$SLMA.pooled.ests.matrix[1,1],")+(", glmslma$SLMA.pooled.ests.matrix[2,1],"*vars.complete$PM_BMI_CONTINUOUS)"),
newobj = "predicted.values")
ds.make(toAssign = "vars.complete$LAB_HDL - predicted.values",
newobj = "residuals")
# and you can use those to run regression plot diagnostics
ds.scatterPlot('predicted.values', "residuals", datasources = connections)
ds.histogram("residuals", datasources = connections)
We want to examine the relationship between BMI and diabetes
Examine the distribution of the variable "DIS_DIAB" in all cohorts using 'ds.table':
ds.table("D$DIS_DIAB")
Check the class of "DIS_DIAB":
ds.class("D$DIS_DIAB")
Examine the association between BMI and diabetes:
glmSLMA_mod2<-ds.glmSLMA(formula="D$DIS_DIAB~D$PM_BMI_CONTINUOUS", family='binomial')
Save effect estimates and standard errors as new objects
estimates <- c(glmSLMA_mod2$betamatrix.valid[2,])
se <- c(glmSLMA_mod2$sematrix.valid[2,])
Meta-analyse the results using rma to obtain study weights:
res <- rma(estimates, sei=se)
Can produce simple forest plots using output:
forest(res, atransf=exp)
We can also add more information to forest plots:
study_names <- c("study 1", "study 2", "study 3")
weights <- c(paste0(formatC(weights(res), format="f", digits=1, width=4), "%"))
forest(res, atransf=exp,
xlab="Crude Odds Ratio", refline=log(1), xlim=c(-0.25,0.5), at=log(c(0.95, 1, 1.1, 1.2, 1.3)),
slab=cbind(paste0(study_names, " (", paste0(weights, ")"))), mlab="RE model")
text(0.5, 4.5, pos=2, "Odds Ratio [95% CI]")
text(-0.25, 4.5, pos=4, "Study (weight)")
Also possible to model multiple explanatory variables and include interactions:
glm_mod1<-ds.glm(formula="D$DIS_DIAB~D$PM_BMI_CONTINUOUS+D$LAB_HDL*D$GENDER", family='binomial', datasources = connections)
The "*" between LAB_HDL and GENDER means fit all possible main effects and interactions between the two covariates.
Compare with results of a study-level meta analysis:
glmSLMA_mod2<-ds.glmSLMA(formula="D$DIS_DIAB~D$PM_BMI_CONTINUOUS+D$LAB_HDL*D$GENDER", family='binomial')
Now compare outputs:
glm_mod1$coefficients
glmSLMA_mod2$SLMA.pooled.ests.matrix
Similar, but differences between the results are accounted for by the different techniques employed.
You can save your workspace:
DSI::datashield.workspace_save(conns = connections, ws = "workspace2023")
Don't forget to log out! Using:
DSI::datashield.logout(connections)
You can restore your workspace, the next time you want to continue with your analysis
connections <- datashield.login(logins = logindata, assign = TRUE, symbol = "D")
ds.ls()
datashield.logout(connections)
connections <- datashield.login(logins = logindata, restore = "workspace2023")
ds.ls()
Also you can delete unwanted workspaces using the datashield.workspace_rm
In Rstudio Server: DON'T forget to use the orange "quit the current R session" button (top right of browser screen) before closing the tab- otherwise you will experience an error message the next time you try to log in.
As you may have noticed, some operations which are more straightforward in R can be more complicated in datashield. To help with this, the dsHelper package allows you to do some common operations in fewer lines of code. DsHelper is an entirely serverside package - it contains only clientside functions which call DataSHIELD functions serverside.
Here are a couple of examples:
dh.dropCols(
df = "D",
vars = c("PM_BMI_CONTINUOUS", "GENDER"),
type = "keep",
new_obj = "df_subset")
ds.colnames("df_subset")
dh.renameVars(
df = "D",
current_names = c("PM_BMI_CONTINUOUS", "GENDER"),
new_names = c("bmi", "sex"),
new_obj = "df_rename")
ds.colnames("df_rename")
Datashield has a range of functions to retrieve statistics, but is limited in that (i) you need to use different functions for different statistics, (ii) you can only get stats for one variable at a time. Dh.GetStats returns many useful stats in a tibble, and allows you to retrieve stats for multiple variables at a time.
neat_stats <- dh.getStats(
df = "D",
vars = c("LAB_TSC", "LAB_TRIG", "LAB_HDL", "LAB_GLUC_ADJUSTED",
"PM_BMI_CONTINUOUS", "DIS_CVA", "MEDI_LPD", "DIS_DIAB", "DIS_AMI",
"GENDER", "PM_BMI_CATEGORICAL"))
neat_stats
Here you see this has returned a list of two tibbles separated into continuous and categorical information. For the categorical variables info is returned on ns, percentages and missingness within each category, whilst for continuous variables info is returned on mean, standard deviation, quantiles and also missingness.
There are many more dsHelper functions designed to make common operations easier in datashield, check out the vignettes at: https://lifecycle-project.github.io/ds-helper/
The dsMediationClient package includes functions to perform causal mediation analysis. The current version of the package (v0.0.4) allows study-level analysis with four different approaches: the simulation-based (wrapper to fuctions of the mediation package from native R), the weighting- and the imputation-based (wrapper to fuctions of the medflex package from native R) and the regression-based (wrapper to fuctions of the regmedint package from native R)
Here is an example of the weighting-based approach:
Let's first to connect to another set of data and specify that we will access the 'mediation' profile that includes the dsMediation package
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org", user = "dsuser", password = "P@ssw0rd", table = "MEDIATION.UPBdata1", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)', profile='mediation')
builder$append(server = "study2", url = "https://opal-demo.obiba.org", user = "dsuser", password = "P@ssw0rd", table = "MEDIATION.UPBdata2", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)', profile='mediation')
builder$append(server = "study3", url = "https://opal-demo.obiba.org", user = "dsuser", password = "P@ssw0rd", table = "MEDIATION.UPBdata3", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)', profile='mediation')
logindata <- builder$build()
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")
We need also to install the dsMediationClient package if is not already installed
devtools::install_github('datashield/dsMediationClient')
library(dsMediationClient)
To apply the weighting-based approach we first run a study-level regression where we regress the exposure and the covariates on the mediator variable
ds.glmSLMA(formula = 'negaff ~ attbin + gender + educ + age',
family = 'gaussian', dataName = 'D', newobj = 'med.fit')
We then expand the data and fit a natural effect model on the expanded data
ds.neWeight(object = 'med.fit', newobj = 'expData')
ds.colnames('expData')
ds.neModel(formula = 'UPB ~ attbin0 + attbin1 + gender + educ + age',
family = 'binomial', se = 'robust', expData = 'expData',
newobj = 'med.out', datasources = connections)
We can finally extract the relevant causal parameter estimates
ds.neEffdecomp(model = 'med.out', datasources = connections)
The ds.mice function creates multiple imputations (replacement values) for multivariate missing data. The function which is a wrapper to the mice (multiple imputation by chained equations) function from native R, has been recently released with the dsBaseClient version 6.3.
Here is an example of how the function can be used:
We can first apply one initial imputation with the default methods,
initialImp <- ds.mice(data='D', m=1, method=NULL, post=NULL, predictorMatrix = NULL,
datasources=connections, newobj_df = 'impSet', seed = 'fixed')
so we can take the output 'method', 'predictorMatrix' and 'post' objects, and modify them if needed.
method <- initialImp$study1$method
method
method["LAB_TRIG"] <- "norm"
method["PM_BMI_CATEGORICAL"] <- "polr"
method
predictorMatrix <- initialImp$study1$predictorMatrix
predictorMatrix
ds.table("D$DIS_CVA")
predictorMatrix[,"LAB_GLUC_ADJUSTED"] <- 0
predictorMatrix
post <- initialImp$study1$post
post
post["PM_BMI_CONTINUOUS"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(15,35))"
post
We can then use the modified objects as inputs in a second call of the function where we do multiple imputations.
newImp <- ds.mice(data='D', m=3, maxit=10, method=method, post=post,
predictorMatrix=predictorMatrix, datasources=connections,
newobj_df='imp_new', newobj_mids='mids_new', seed='fixed')
ds.ls()
Let's compare some statistics of the original and the imputed data
ds.numNA('D$PM_BMI_CONTINUOUS', datasources=connections)
ds.numNA('imp_new.1$PM_BMI_CONTINUOUS', datasources=connections)
ds.numNA('imp_new.2$PM_BMI_CONTINUOUS', datasources=connections)
ds.numNA('imp_new.3$PM_BMI_CONTINUOUS', datasources=connections)
ds.mean('D$LAB_HDL', datasources=connections)
ds.mean('imp_new.1$LAB_HDL', datasources=connections)
ds.mean('imp_new.2$LAB_HDL', datasources=connections)
ds.mean('imp_new.3$LAB_HDL', datasources=connections)
There is a website for a book that serves as official documentation for OmicSHIELD. It can be found here:
https://isglobal-brge.github.io/OmicSHIELD/
On it you will find introductory references to learn about DataSHIELD and “resources”, explanation on the type of analysis that can be performed using OmicSHIELD and workflows (with reproducible code) of the main functionalities of OmicSHIELD. Use cases in which OmicSHIELD is applied to real datasets are presented for illustrating the capabilities of the software for omic analyses (GWAS, transcriptomics and EWAS).