This tutorial demonstrates how to perform data manipulation using the dsTidyverse
and dsTidyverseClient
packages. These are community packages which implement selected functions from the tidyverse
collection of R packages.
install.packages("dsTidyverseClient")
load(dsTidyverseClient)
ds.rename
renames columns within a server-side data frame. Column names are passed unquoted to tidy_expr
and are evaluated as column names within df.name
.
ds.colnames("mtcars")
#> $server_1
#> [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
ds.rename(df.name = "mtcars", tidy_expr = list(effiency = mpg, power = hp), newobj = "new_df")
ds.colnames("new_df")
#> $server_1
#> [1] "effiency" "cyl" "disp" "power" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
ds.select
creates a subset of columns in a server-side data frame. Column names are passed unquoted to tidy_expr
and are evaluated as column names within df.name
.
ds.colnames("mtcars")
#> $server_1
#> [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
ds.select(df.name = "mtcars", tidy_expr = list(mpg, disp, wt), newobj = "new_df")
ds.colnames("new_df")
#> $server_1
#> [1] "mpg" "disp" "wt"
ds.select(df.name = "mtcars", tidy_expr = list(mpg:drat), newobj = "new_df")
ds.colnames("new_df")
#> $server_1
#> [1] "mpg" "cyl" "disp" "hp" "drat"
ds.select(df.name = "mtcars", tidy_expr = list(!mpg), newobj = "new_df")
ds.colnames("new_df")
#> $server_1
#> [1] "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
Subset set a data frame flexibly, retaining all rows that satisfy the condition supplied to tidy_expr
ds.dim("mtcars")[[1]]
#> [1] 32 11
ds.filter(
df.name = "mtcars",
tidy_expr = list(cyl == 4 & mpg > 20),
newobj = "filtered",
datasources = conns)
ds.dim("filtered")[[1]]
#> [1] 11 11
ds.mutate
creates new columns in a server-side dataframe. These new columns are normally transformations of existing columns. Again, column names are passed unquoted to tidy_expr
and are evaluated as column names within df.name
.
ds.mutate(
df.name = "mtcars",
tidy_expr = list(mpg_trans = cyl*1000),
newobj = "new_df")
ds.mean("mtcars$cyl")$Mean.by.Study
#> EstimatedMean Nmissing Nvalid Ntotal
#> server_1 6.1875 0 32 32
ds.mean("new_df$mpg_trans")$Mean.by.Study
#> EstimatedMean Nmissing Nvalid Ntotal
#> server_1 6187.5 0 32 32
ds.if_else
allows you to specify a condition, and return values based on whether the condition is 'true' or 'false'
ds.if_else(
condition = list(mtcars$mpg > 20),
"high",
"low",
newobj = "mpg_cat")
ds.table("mpg_cat")$output.list$TABLE_rvar.by.study_counts
#>
#> Data in all studies were valid
#>
#> Study 1 : No errors reported from this study
#> study
#> mpg_cat server_1
#> high 14
#> low 18
#> NA 0
An extension of ds.if_else
which allows the specification of multiple conditions. Extremely
useful for recoding variables.
ds.case_when(
tidy_expr = list(
mtcars$mpg < 20 ~ "low",
mtcars$mpg >= 20 & mtcars$mpg < 30 ~ "medium",
mtcars$mpg >= 30 ~ "high"),
newobj = "recoded",
datasources = conns)
ds.table("recoded")$output.list$TABLE_rvar.by.study_counts
#>
#> Data in all studies were valid
#>
#> Study 1 : No errors reported from this study
#> study
#> recoded server_1
#> high 4
#> low 18
#> medium 10
#> NA 0
Bind any number of data frames by row, making a longer result. This is similar to do.call(rbind, dfs), but the output will contain all columns that appear in any of the inputs.
ds.dim("mtcars")[[1]]
#> [1] 32 11
ds.bind_rows(
to_combine = list(mtcars, mtcars),
newobj = "df_bound",
datasources = conns
)
ds.dim("df_bound")[[1]]
#> [1] 64 11
Bind any number of data frames by column, making a wider result. This is similar to do.call(cbind, dfs).
ds.colnames("mtcars")
#> $server_1
#> [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
ds.bind_cols(
to_combine = list(mtcars, mtcars),
newobj = "df_bound",
.name_repair = "minimal",
datasources = conns
)
ds.colnames("df_bound")
#> $server_1
#> [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb" "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs"
#> [20] "am" "gear" "carb"
arrange
is a more flexible version of base::sort
, which orders the rows of a data frame by the values of selected columns.
ds.arrange(
df.name = "mtcars",
tidy_expr = list(mpg, cyl),
newobj = "arranged_df",
datasources = conns
)
Many data operations are performed on groups defined by variables. group_by
takes an existing tibble or data frame and converts it into a grouped tibble. Subsequent operations can then be performed 'by group' rather than to the whole dataframe.
To group a data frame:
ds.group_by(
df.name = "mtcars",
tidy_expr = list(mpg, cyl),
newobj = "grouped"
)
ds.class("grouped")[[1]]
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
To ungroup a data frame:
ds.ungroup("grouped", "no_longer_grouped")
To return to clientside details of the groups, use ds.group_keys
. Note that this is subject to disclosure controls and will return an error if the number of groups is too high.
my_groups <- ds.group_keys("mtcars_group")
my_groups
#> $server_1
#> # A tibble: 3 × 1
#> cyl
#> <dbl>
#> 1 4
#> 2 6
#> 3 8
ds.distinct(
df.name = "mtcars",
newobj = "distinct_df"
)
ds.dim("distinct_df")[[1]]
#> [1] 32 11
Merging two datasets requires the use of ds.merge
.