There are many different ways of setting up an infrastructure that supports DataSHIELD. This comes about for several reasons:
In this section we will look at some of the technical considerations and the options that are available for infrastructures supporting DataSHIELD.
Windows is not generally supported by any of the platforms that can provide DataSHIELD functionality. Ubuntu or Redhat Linux are preferred. In the past, the different software components that are needed to run DataSHIELD had to be installed onto a server or virtual machine. For example, a database is needed to hold the data, so this would need to be installed, and then configured to work as required. Other libraries such as Java might need to be installed. This made it complex because there was scope for different versions of software and how they are set up. These variations could either be left for each group to resolve as best they can, or by providing very detailed documentation. This requires instructions like:
Using apt, you can install Java 8 via the following sequence of commands:
sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer
To maintain this type of documentation takes a lot of effort and still leaves scope for errors in interpreting the many pages of instructions.
An attempt to address this has been made through the use of containerisation technologies such as Docker and Kubernetes. These provide advantages although not all institutions are willing to use them. This reluctance can be due to a caution around adopting new ways of doing things or a lack or local expertise.
Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. Furthermore, the interaction and configuration between all the containers needed to run DataSHIELD can be specified in a machine readable file. As long as the host server is running a container platform like Docker or Kubernetes, they can reproduce a standard configuration.
Not using containerisation makes it more challenging to have a working set up. As a specific example, it can be harder to upgrade the R components when a new version of R is available as all the dependencies have to be reinstalled. With containerisation, the consortium technical lead can generate a new, working container, and distribute this to the participating groups so they don't have to replicate the work.
Providing a ready configured Virtual Machine is a more heavy weight approach (they can be large to move around) and does not help with upgrades. Either the maintainer has to upgrade the software from the command line (leaving scope for errors) or the whole VM needs to be replaced. Replacing the VM would require data and configurations to be ported across.
There are options about where the hardware that hosts the stack supporting DataSHIELD is situated. In this section we will try to explain the options that are available:
Note that within these broad definitions there is scope for variation. For example, the local institution might decide to use AWS infrastructure which has benefits around costs and flexibility of capacity. This would have to be in line with the data's policy as it strictly is leaving the institution.
The pros and cons of these methods are summarised below:
Local | Local but managed remotely by an external IT support team | Completely remote, hosted by a third party |
---|---|---|
Control over own data | 100%, protected by physical location | 100%, protected by contract* required with the external party for data handling |
Involvement of own IT department and accompanying skills and time required | 100%. All problems have to be solved locally. | 50-50. Local IT dept required for set-up but day-to-day management is outsourced to a third party. Need secure way for third party to access the system (e.g. VPN) |
Capacity requirements in the IT infrastructure, eg for omics data | 100%. All capacity requirements have to be met locally. | All capacity requirements still have to be met locally. |
Data transfer involved? | No | No – but contract* required for data handling |
How payments work | Pay your IT department yourself | Easier to pay an external party with grant applications? |
* One of the advantages of the DataSHIELD approach is to reduce the need for data transfer/handling agreements, as these often take a long time to prepare. Although in these scenarios they are required, they are only needed between the institution providing the data and the institution providing the infrastructure. Once it is in place, any new group wanting to use the data via DataSHIELD will not need such an agreement and therefore overall the need for these agreements is significantly reduced.
In this section we discuss the decisions that need to made around networking issues that consortia might encounter.
Previously we described the "client side" as being a computer local to the analyst, where they write code that gets sent to the "server side" which are the computers where the data are held. The "client side" computer could be the analyst's own personal computer with the appropriate DataSHIELD R packages installed, but many consortia choose to provide analysts with a log on to a "client side" computer that is available to all consortium analysts - frequently referred to as the "central analysis server".
There are two main reasons for doing this. Firstly, the analyst doesn't have to set up their own computer to run the client side DataSHIELD packages. The second reason relates to the fact that the "server side" computer hosting the data has to accept inbound connections from the "client side" computer requesting the analysis. If every user has their own personal "client side" computer, the data owner either has to manage a list of IP addresses to permit through the firewall (this will be difficult to maintain, as someone's IP address might change if they decide they want to work in Starbucks) or open the firewall to inbound connections from the internet (bad people will try and get in). Some institutions, particularly if they are a hospital or located within a hospital, do not like open their firewalls to inbound connections from the internet as this presents a security risk. Having a single, well maintained and protected "client side" computer for all users that connects to the server holding their data is more attractive as the firewall can be opened to a single IP address and the security risk is much reduced.
That said, it means that someone has to provide and maintain the central analysis server, which might not be possible long term. And some institutions may be confident in their security measures to open up to the internet anyway, for example if they have a Universal Threat Management (UTM) system and an additional perimeter firewall.
The MOLGENIS team have provided a template for a central analysis server that is based on JupyterHub.
When you connect securely to a website, the connection is encrypted using the site's SSL certificate. The certificate also verifies the site's identity. Without it, an intruder could place their computer between you and the website and pretend to be the website in order to steal your credentials. Therefore servers hosting DataSHIELD should also use an SSL certificate. This illustrates a scenario where a group looking to set up a server will need to liaise with their institutional level IT team, because using an SSL certificate requires you to have a correctly registered web address. This means your server must be found at something like "datashield.myuni.ac.uk" (a "fully qualified domain name") rather than an IP address like "187.234.123.11". Registering the web address will need the input of an institution level IT team. Additionally there may be a cost associated with the certificate.
It is worth noting that Coral allows a certificate to be generated automatically and for free using the Let's Encrypt service. The server will still need to be registered at a proper web address rather than an IP address, which may need work from institutional IT teams.
Traditionally a username and password have been used to allow a user to authenticate themselves for each system they access. More recent models aim to simplify this for the user and improve security for system owners. The downside of this is that these arrangements can be more challenging to understand and implement. In this section we explore some of the authentication options.
In large consortia, a user could potentially end up with a lot of different usernames and passwords for each data source they wish to work with. This can be difficult to manage. Therefore it could be desirable to make use of a central authentication service.
Two factor authentication adds an additional layer of security by sending a code to user's device that they must enter in addition to their password.
Personal Access Tokens can be used for users to authenticate. The advantages to this are:
Reverse proxy recommended?
Modsecurity?
To assist in a scenario where a server hosting data suffers a problem, it is recommended to back up the data. This platform specific
What data and what are you doing to it? See "resources"
At the moment only Coral offers monitoring. Opal is working on it.