The key point is that DataSHIELD provides the bridge between the analyst and the data. The analyst is in one location and the data are held at other separate locations. At the analyst's location (often called "client side") the analyst can write commands in the DataSHIELD language, which is hosted in an R session local to the analyst. These DataSHIELD commands are passed to the servers at the data locations. At the data locations (often called "server side") the DataSHIELD commands are received from the analyst to run on the data that is available in a separate, isolated R session local to the data. The DataSHIELD commands contain built in safeguards that are designed to stop individual pieces of data being returned to the analyst and instead return useful summary information. For example, a mean can only be calculated on more than 5 pieces of data to reduce the chances of a data item being revealed. Only commands that have been built into DataSHIELD and that have these safeguards can be run. Arbitrary R commands ("return row 6 of the data set") simply cannot be run in DataSHIELD. When the DataSHIELD command has been run at each location, the results are returned to the analyst (or an error message if something went wrong). More complex commands may involve several back and forth communications between client and server and additionally aggregation of results from the data sets.
It is important to stress that DataSHIELD functions are designed to enhance privacy, but don't guarantee privacy. A determined attacker can always exploit weaknesses. There is a balance between allowing data to be used for good and the amount of restrictions around its usage. We tend to hold the view that most researchers are not willfully malicious (they have a reputation to uphold) and as such we aim for "reasonable" measures of restriction.
This is a good question. Many would consider DataSHIELD to be a matching pair of R packages, one that the analyst users and the other that sits with the data. That is only part of the story, as DataSHIELD needs something to provide the R environment where the packages are installed, and to inject the data into that environment. Therefore another component is needed to play that part of data warehouse / R orchestrator, and there are several options available. We will discuss that later.
The most common use case for using DataSHIELD is where a group of researchers have a shared research goal and have data to address the research. However, they don't want to centrally deposit the data or email around an analysis plan. Nor do they have existing Trusted Research Environments (TREs) to allow others to reach in and work with their data. This is not the only use case but the most frequent.
This diagram, taken from this paper, summaries some of the steps in the process that were used in the LifeCycle project
Additional parts of the process are:
Each of these steps represent a large amount of work. Such is the initial activation cost that this approach is best suited to a group that intends to embark on a long term series of publications, as the subsequent analyses require less effort once the project is established.
Optionally the consortium might wish to build a catalogue and curate the metadata describing their datasets. This can be helpful (some would argue it is vital) for understanding what future research can be done and how data can be harmonised.
In this paper by Fortier and colleagues, these issues are also discussed with a particular focus on the harmonisation of the data.
To get the project running requires a collaborative, interdisciplinary team of experts, orchestrated by a consortium manager:
Developers / statisticians / data scientists are required if new analysis functionality needs to be built into DataSHIELD. Recall that each command or function that can be run in DataSHIELD has to be explicitly enabled and adapted to reduce the possibility of data leakage. There is more detail on how new functionality is built later on. They also support the researchers in running their analyses in a federated setting. A consortium manager or coordinator is needed to assemble the team and make sure it works in a coherent way. If possible, they should also ensure sustainability of the arrangements because otherwise the high set up costs are wasted if the consortium fades away after the initial funding expires. The consortium technical specialists are needed to guide the IT staff and technical experts that set up the infrastructure to hold each data set. A large part of this guide is designed to help them.... Ethics and governance experts are needed to help explain this novel way of working to those who are more used to authorising data releases or access to Trusted Research Environments (TREs). Hopefully they can demonstrate that the safeguards designed to reduce the chances of individual data items leaving the hosting institution mean that a lighter weight data access agreement is needed. This is instead of a more bureaucratic data transfer agreement. Some of these that are relevant to the technical work warrant further explanation and are covered in more detail in the People and Organisation section.
DataSHIELD is different from Trusted Research Environments (TREs), which allow users to log in to an isolated platform to work on each agreed data set. Typically these are run by an institution, and when a user has logged in they have full access to a particular data set using any analysis tool. This is an advantage for the user as they can use familiar statistics packages and it is easier to work with data that you see without restriction (DataSHIELD is much more limited on what you can do, and you can't see the data). To protect the the data, the user cannot remove results by email, FTP or cut and paste, but can take results out through an airlock where they are manually checked by a data custodian before being released. This manual checking process can be resource intensive (expensive) and subject to human error (data could be leaked). When an analysis uses different data sets held by different institutions, the user has to go through an approvals process for each TRE, run the analysis in the TRE and extract the results. They then have to combine results using meta analysis for an overall answer. There is no opportunity for analysing the data as if it were all pooled in a single location, which might be preferable to a meta analysis method.
Another class of systems allow the execution of pre-agreed analyses on each data set. Examples of these include Vantage6. These suffer from the problem that each party has to approve every analysis which can take time and effort. If a mistake is later found in the code, or the results prompt follow up question, the whole process starts again. Synthetic data can help resolve any potential issues before the pipeline is run, reducing the scope for errors and going through the approvals process.
Something about MHE or DP? Or accept that this is not covering these in detail - see forthcoming landscape review
From here: https://docs.google.com/presentation/d/1mgf14W4Eo7OMIUvf21EowaI4OQ7o-G4b/edit#slide=id.p20
Could also consider at a more detailed level, for example Bitfount vs DataSHIELD vs Vantage6