This tutorial demonstrates how to perform data manipulation using the core functionality within the dsBaseClient package.
library(dsBaseClient)
To create new objects using DataSHIELD, you use the function ds.assign()
. Note using functions with dsBaseClient
, it is not possible to directly create a new column in an existing dataframe - you first need to create a separate object, and join it back into the data frame. For example, to convert the variable Age
in the titanic dataset from months to years, you can do:
ds.assign("titanic$Age*12", "age_months")
ds.summary("age_months")
## $server1
## $server1$class
## [1] "numeric"
##
## $server1$length
## [1] 444
##
## $server1$`quantiles & mean`
## 5% 10% 25% 50% 75% 90% 95% Mean
## 36.0000 168.0000 240.0000 336.0000 432.0000 552.0000 661.2000 343.9244
##
##
## $server2
## $server2$class
## [1] "numeric"
##
## $server2$length
## [1] 447
##
## $server2$`quantiles & mean`
## 5% 10% 25% 50% 75% 90% 95% Mean
## 60.0000 163.2000 252.0000 360.0000 468.0000 604.8000 684.0000 368.8545
If you want to rejoin this variable to the original dataset, you can use:
ds.cbind(x = c("titanic", "age_months"), newobj = "titanic")
## $is.object.created
## [1] "A data object <titanic> has been created in all specified data sources"
##
## $validity.check
## [1] "<titanic> appears valid in all sources"
ds.colnames("titanic")
## $server1
## [1] "entity_id" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked" "age_months"
##
## $server2
## [1] "Survived" "Pclass" "Name" "Sex" "Age"
## [6] "SibSp" "Parch" "Ticket" "Fare" "Cabin"
## [11] "Embarked" "age_months"
If you would prefer to create new columns within an existing dataset, you can use the ds.mutate()
function within the dsTidyverse
package.
To change the class of variables you can use equivalent functions to base R. For example, in the titanic dataset sex is currently a character variable, but it would be more convenient to change this to a factor. We can do:
ds.colnames("titanic")
## $server1
## [1] "entity_id" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked" "age_months"
##
## $server2
## [1] "Survived" "Pclass" "Name" "Sex" "Age"
## [6] "SibSp" "Parch" "Ticket" "Fare" "Cabin"
## [11] "Embarked" "age_months"
ds.asFactor("titanic$Sex", "sex_factor")
## $all.unique.levels
## [1] "female" "male"
##
## $return.message
## [1] "Data object <sex_factor> correctly created in all specified data sources"
ds.table("sex_factor")
##
## Data in all studies were valid
##
## Study 1 : No errors reported from this study
## Study 2 : No errors reported from this study
## $output.list
## $output.list$TABLE_rvar.by.study_row.props
## study
## sex_factor server1 server2
## female 0.5445860 0.4554140
## male 0.4731369 0.5268631
## NA NaN NaN
##
## $output.list$TABLE_rvar.by.study_col.props
## study
## sex_factor server1 server2
## female 0.3851351 0.3199105
## male 0.6148649 0.6800895
## NA 0.0000000 0.0000000
##
## $output.list$TABLE_rvar.by.study_counts
## study
## sex_factor server1 server2
## female 171 143
## male 273 304
## NA 0 0
##
## $output.list$TABLES.COMBINED_all.sources_proportions
## sex_factor
## female male NA
## 0.352 0.648 0.000
##
## $output.list$TABLES.COMBINED_all.sources_counts
## sex_factor
## female male NA
## 314 577 0
##
##
## $validity.message
## [1] "Data in all studies were valid"
See also ds.asInteger
, ds.asDataMatrix
, ds.asDataFrame
, ds.asCharacter
and ds.as_tibble
.
The function ds.merge
allows you to join datasets using a left join, right join, inner join or outer join. The type of join is controlled by the arguments all.x
and all.y
. To demonstrate we first create two copies of the survival dataset with different columns (here we use the package dsTidyverse
for ease), and then perform a left join, merging on id
:
ds.select("survival", list(id:survtime), newobj = "survival_1")
ds.select("survival", list(id, cens:bmi.26), newobj = "survival_2")
ds.merge(
x.name = "survival_1",
y.name = "survival_2",
by.x.names = "id",
by.y.names = "id",
all.x = T,
all.y = F,
newobj = "merged_df"
)
## $is.object.created
## [1] "A data object <merged_df> has been created in all specified data sources"
##
## $validity.check
## [1] "<merged_df> appears valid in all sources"
Within dsBaseClient
there is no single function to rename a variable. As a workaround you can create a copy of the variable with a new name and then join back in the new variable.
ds.assign("titanic$Fare", "ticket_fare")
ds.cbind(x = c("titanic", "ticket_fare"), newobj = "titanic")
## $is.object.created
## [1] "A data object <titanic> has been created in all specified data sources"
##
## $validity.check
## [1] "<titanic> appears valid in all sources"
See ds.rename()
for a more efficient solution if you have access to the dsTidyverse
package.
Removing rows based on a condition can be done using ds.dataFrameSubset
. Note that you can only specify one condition, so if you need to subset on multiple conditions you need to perform the operation multiple times with diferent conditions.
ds.dim("titanic")
## $`dimensions of titanic in server1`
## [1] 444 14
##
## $`dimensions of titanic in server2`
## [1] 447 13
##
## $`dimensions of titanic in combined studies`
## [1] 891 14
ds.dataFrameSubset(
df.name = "titanic",
V1.name = "titanic$Survived",
V2.name = "1",
Boolean.operator = "==",
newobj = "titanic_survivors"
)
## $is.object.created
## [1] "A data object <titanic_survivors> has been created in all specified data sources"
##
## $validity.check
## [1] "<titanic_survivors> appears valid in all sources"
ds.dim("titanic_survivors")
## $`dimensions of titanic_survivors in server1`
## [1] 173 14
##
## $`dimensions of titanic_survivors in server2`
## [1] 169 13
##
## $`dimensions of titanic_survivors in combined studies`
## [1] 342 14
See ds.filter()
for a more flexible option if you have access to the dsTidyverse
package.
To remove columns from a dataframe, you can use ds.dataFrameSubset
. Again, the process is a little convoluted. Because the function was primarily designed to subset rows (ie filter a dataset), you need to provide a filter condition. The easiest way to do this is to create a vector of an identical number for all participants, and then subset by this vector (keeping all participants), and using the additional argument to select columns. A further limitation is that it is only possible to select columns by their number (not their name), so you need to check the order of the columns in each dataframe and specify this. Here is the workflow:
ds.dim("titanic")
## $`dimensions of titanic in server1`
## [1] 444 14
##
## $`dimensions of titanic in server2`
## [1] 447 13
##
## $`dimensions of titanic in combined studies`
## [1] 891 14
ds.rep(
x1 = "1",
times = 444,
each = 1,
newobj = "ones",
source.x1 = "clientside",
source.times = "c",
source.length.out = "c",
source.each = "c",
datasources = conns["server1"])
## $is.object.created
## [1] "A data object <ones> has been created in all specified data sources"
##
## $validity.check
## [1] "<ones> appears valid in all sources"
ds.rep(
x1 = "1",
times = 447,
each = 1,
newobj = "ones",
source.x1 = "clientside",
source.times = "c",
source.length.out = "c",
source.each = "c",
datasources = conns["server2"])
## $is.object.created
## [1] "A data object <ones> has been created in all specified data sources"
##
## $validity.check
## [1] "<ones> appears valid in all sources"
ds.colnames("titanic")
## $server1
## [1] "entity_id" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked" "age_months" "ticket_fare"
##
## $server2
## [1] "Survived" "Pclass" "Name" "Sex" "Age"
## [6] "SibSp" "Parch" "Ticket" "Fare" "Cabin"
## [11] "Embarked" "age_months" "ticket_fare"
ds.dataFrameSubset(
df.name = "titanic",
V1.name = "ones",
V2.name = "ones",
Boolean.operator = "==",
keep.cols = c(2, 3, 4),
newobj = "titanic_subset",
datasources = conns["server1"]
)
## $is.object.created
## [1] "A data object <titanic_subset> has been created in all specified data sources"
##
## $validity.check
## [1] "<titanic_subset> appears valid in all sources"
ds.dataFrameSubset(
df.name = "titanic",
V1.name = "ones",
V2.name = "ones",
Boolean.operator = "==",
keep.cols = c(1, 2, 3),
newobj = "titanic_subset",
datasources = conns["server2"]
)
## $is.object.created
## [1] "A data object <titanic_subset> has been created in all specified data sources"
##
## $validity.check
## [1] "<titanic_subset> appears valid in all sources"
ds.colnames("titanic_subset")
## $server1
## [1] "Survived" "Pclass" "Name"
##
## $server2
## [1] "Survived" "Pclass" "Name"
See ds.select()
for a more efficient option if you have access to the dsTidyverse
package.
Creating a new variable based on a condition can be performed using ds.Boole
. This function returns 1 if a condition is met, and 0 if not. For example, we can create a categorical variable stating whether variable Fare
is >= or less than £50:
ds.Boole(
V1 = "titanic$Fare",
V2 = "50",
Boolean.operator = ">=",
newobj = "fare_bin"
)
## $is.object.created
## [1] "A data object <fare_bin> has been created in all specified data sources"
##
## $validity.check
## [1] "<fare_bin> appears valid in all sources"
ds.table("fare_bin")
##
## Data in all studies were valid
##
## Study 1 : No errors reported from this study
## Study 2 : No errors reported from this study
## $output.list
## $output.list$TABLE_rvar.by.study_row.props
## study
## fare_bin server1 server2
## 0 0.4972603 0.5027397
## 1 0.5031056 0.4968944
## NA NaN NaN
##
## $output.list$TABLE_rvar.by.study_col.props
## study
## fare_bin server1 server2
## 0 0.8175676 0.8210291
## 1 0.1824324 0.1789709
## NA 0.0000000 0.0000000
##
## $output.list$TABLE_rvar.by.study_counts
## study
## fare_bin server1 server2
## 0 363 367
## 1 81 80
## NA 0 0
##
## $output.list$TABLES.COMBINED_all.sources_proportions
## fare_bin
## 0 1 NA
## 0.819 0.181 0.000
##
## $output.list$TABLES.COMBINED_all.sources_counts
## fare_bin
## 0 1 NA
## 730 161 0
##
##
## $validity.message
## [1] "Data in all studies were valid"
See ds.if_else()
for a more flexible option if you have access to the dsTidyverse
package.
To transform a variable based on multiple conditions, you need to make multiple calls to ds.Boole()
. As an example, we can create a 3-level categorical variable indicating whether the fare paid was <30, >=30 and less than 70, or >= 70. To do this we first use ds.Boole
to evaluate whether the second and third condition are met, and then ds.assign()
to create the categorical variable.
ds.Boole(
V1 = "titanic$Fare",
V2 = "30",
Boolean.operator = ">=",
newobj = "fare_30"
)
## $is.object.created
## [1] "A data object <fare_30> has been created in all specified data sources"
##
## $validity.check
## [1] "<fare_30> appears valid in all sources"
ds.Boole(
V1 = "titanic$Fare",
V2 = "70",
Boolean.operator = ">=",
newobj = "fare_70"
)
## $is.object.created
## [1] "A data object <fare_70> has been created in all specified data sources"
##
## $validity.check
## [1] "<fare_70> appears valid in all sources"
ds.assign("fare_30+fare_70", newobj = "fare_cat")
ds.table("fare_cat")
##
## Data in all studies were valid
##
## Study 1 : No errors reported from this study
## Study 2 : No errors reported from this study
## $output.list
## $output.list$TABLE_rvar.by.study_row.props
## study
## fare_cat server1 server2
## 0 0.5084485 0.4915515
## 1 0.4370370 0.5629630
## 2 0.5142857 0.4857143
## NA NaN NaN
##
## $output.list$TABLE_rvar.by.study_col.props
## study
## fare_cat server1 server2
## 0 0.7454955 0.7158837
## 1 0.1328829 0.1700224
## 2 0.1216216 0.1140940
## NA 0.0000000 0.0000000
##
## $output.list$TABLE_rvar.by.study_counts
## study
## fare_cat server1 server2
## 0 331 320
## 1 59 76
## 2 54 51
## NA 0 0
##
## $output.list$TABLES.COMBINED_all.sources_proportions
## fare_cat
## 0 1 2 NA
## 0.731 0.152 0.118 0.000
##
## $output.list$TABLES.COMBINED_all.sources_counts
## fare_cat
## 0 1 2 NA
## 651 135 105 0
##
##
## $validity.message
## [1] "Data in all studies were valid"
The object we have created has value 0 if none of the condtions are met (ie for passengers with a fare <30), 1 if one condition is met (>= to but not 70), or 2 if both conditions are met (>= 30 & >= 70).
See ds.case_when
for a more efficient and fleixble option if you have access to dsTidyverse
.
To bind any number of data frames by column you can use ds.cbind()
. Note that this function simply combines two datasets based on the current order; to ensure that ids are linked use ds.merge()
.
ds.cbind(c("titanic$Fare", "titanic$Pclass"), newobj = "titanic_two_variables")
## $is.object.created
## [1] "A data object <titanic_two_variables> has been created in all specified data sources"
##
## $validity.check
## [1] "<titanic_two_variables> appears valid in all sources"
ds.colnames("titanic_two_variables")
## $server1
## [1] "Fare" "Pclass"
##
## $server2
## [1] "Fare" "Pclass"
This is not possible using dsBase
, see ds.arrange
in the dsTidyverse
package.
This is not possible using dsBase
, see ds.distinct
in the dsTidyverse
package.
This tutorial covers the main data manipulation operations but is not exaustive: run ls(package:dsBaseClient)
to see all available functions.
datashield.logout(conns)