The key purpose of DataSHIELD is to allow users to interrogate data for genuine research results while protecting the data from disclosure. We define “disclosure” to mean the unwanted leakage of data from a data warehouse, where the leakage could have damaging effects such as compromising an individual’s privacy or loss of intellectual property.
Examples of disclosure include a user maliciously:
Authors of DataSHIELD packages should align with this aim by building in features to protect the data from disclosure. This will protect the reputation of DataSHIELD as a practical and reliable way for users to conduct research without data disclosure.
To benefit the wider user community and to reduce duplication of effort, the hope is that packages can be made available for use by others. We would like to index as many packages as possible in a central catalogue. For other researchers to use packages, data custodians must be satisfied that packages apply “reasonable” safeguards. We can say that the safeguards are “reasonable” rather than “perfect” for 2 reasons:
To meet this “reasonable” level, packages should undergo a peer review by a member of the DataSHIELD community / member of the statistical development working group / whatever. This process is not meant to be a large burden for either party, but an acknowledgement that some sanity checks should be made in addition to the implicit trust between parties.
The process could be initiated through the DataSHIELD Forum. Requests for an audit could be made in the Developer Support category, or in a new separate category. This allows a package developer and auditor to make contact.
We suggest that a brief written statement is made by the package developer covering this checklist:
These could then be discussed briefly with the reviewer before the package is approved and listed as such on the DataSHIELD package catalogue.
dsSurvival provides survival analysis in DataSHIELD. It is based on the survival package in native R. The key functionalities are:
● Fitting Cox proportional hazards models
● Returning summaries of models for meta analysis
● Model diagnostics
The main results returned by the package are summaries of the Cox models and the diagnostics of these models. The ds.Surv() function manipulates data into a format required for the Cox model to be fitted. It would present a disclosure risk if someone could access individual rows, but this is protected by standard dsBase functions. Prior to running the models, the function checks that the number of parameters is not too large compared to the number of data points. This avoids the model being oversaturated which could lead to disclosure. This check uses the nfilter.glm parameter and offers protection when there are a small number of individuals in the data set. When the model has been fitted (using the native coxph function), only the summary() of the model is returned as this does not contain residuals. The coxZPH functionality is provided so that users can test if their models satisfy the proportional hazards assumption. Again this is a repackage of the coxzph function from the survival package. We remove the x, y and time elements before returning to the user as these contain the raw data. Lastly, the vcov function returns a variance-covariance matrix from a fitted Cox model. If the model has been fitted, then it has passed the oversaturation test and there are no further checks required.