When we set out to develop a new function, often we aim to do something quite complicated with multiple steps. If you are new to developing, the natural way to write this may be to include all the steps sequentially within one function. Whilst your function may work, if it contains many lines of code this can lead to the following problems:
The solution to this is refactoring. Refactoring consists of breaking a longer function into smaller components. In this tutorial, we'll walk through a simple example of a long function and see how we can improve it by breaking it into smaller, more focused functions.
clean_transform_mean <- function(data, n_rows, cols, new_names, message) {
if(is.null(data)) {
cli_abort("`data` must not be NULL")
}
if(is.null(n_rows)) {
cli_abort("`n_rows` must not be NULL")
}
if(is.null(cols)) {
cli_abort("`cols` must not be NULL")
}
if(is.null(new_names)) {
cli_abort("`new_names` must not be NULL")
}
if(length(new_names) != length(cols)) {
cli_abort(
c(
x = "Length of `new_names` must be the same as `cols`",
i = "`cols` has length {length(cols)}",
i = "`new_names` has length {length(new_names)}"
)
)
}
data <- data |> dplyr::select(any_of(cols))
new_data <- new_data[n_rows, ]
new_data <- new_data |> mutate(across(everything(), function(x) {x^3}))
means <- new_data |> summarise(across(everything(), mean, na.rm = TRUE))
cli_inform(
c("i" = "{message}",
">" = "{means}"
)
)
}
This function cleans some data, performs a transformation, calculates the means of the transformed columns and prints these values with a message. Here we can see what happens when we run it:
> clean_calculate_and_print(
+ data = mtcars,
+ n_rows = 1:5,
+ cols = c("mpg", "cyl", "wt"),
+ new_names = c("miles_per_gallon", "cylinder", "weight"),
+ message = "Here are your mean values")
+
ℹ Here are your mean values
→ 20.090625, 6.1875, and 3.21725
Whilst the function works, it could be daunting for someone who hasn't written to understand it. Also, it would be difficult to write a unit test for this, as we would be trying to test many things at once.
clean_calculate_and_print <- function(data, n_rows, cols, new_names, message) {
.check_args(data, n_rows, cols, new_names)
data_cleaned <- .clean_data(data, cols, n_rows)
data_transformed <- .transform_data(data_cleaned)
means <- .calculate_means(data_transformed)
.print_means_with_message(message, means)
}
.check_args <- function(data, n_rows, cols, new_names) {
if(is.null(data)) {
cli_abort("`data` must not be NULL")
}
if(is.null(n_rows)) {
cli_abort("`n_rows` must not be NULL")
}
if(is.null(cols)) {
cli_abort("`cols` must not be NULL")
}
if(is.null(new_names)) {
cli_abort("`new_names` must not be NULL")
}
if(length(new_names) != length(cols)) {
cli_abort(
c(
x = "Length of `new_names` must be the same as `cols`",
i = "`cols` has length {length(cols)}",
i = "`new_names` has length {length(new_names)}"
)
)
}
}
.clean_data <- function(data, cols, n_rows) {
data <- data |> dplyr::select(any_of(cols))
new_data <- new_data[n_rows, ]
return(new_data)
}
.transform_data <- function(data) {
return(new_data |>
mutate(across(everything(), function(x) {x^3})))
}
.calculate_means <- function(data) {
means <- new_data |>
summarise(across(everything(), mean, na.rm = TRUE))
}
.print_means_with_message <- function(message, means) {
cli_inform(
c("i" = "{message}",
">" = "{means}"
)
)
}
In the refactored version, we have taken each component and put it in a separate function. A good role of thumb in refactoring is one task, one function. We have also tried to name functions with a clear description of what they do. To someone coming new to the function, it should be clearer what each step does. We can also now write simpler unit tests which focus on each element of the functionality.