Why dplyover?

Introduction

{dplyover} extends {dplyr}’s functionality by building a function family around dplyr::across().

The goal of this over-across function family is to provide a concise and uniform syntax which can be used to create columns by applying functions to vectors and/or sets of columns in {dplyr}. Ideally, this will:

  • reduce the amount of code to create variables derived from existing colums, which is especially helpful when doing explanatory data analysis (e.g. lagging, collapsing, recoding etc. many variables in a similar way).
  • provide a clean {dplyr} approach to create many variables which are calculated based on two or more variables.
  • improve our mental model so that it is easier to tackle problems where the solution is based on creating new columns.

To illustrate the points above it is helpful to look at a couple of use cases. This vignette will (i) look at four examples, (ii) present some alternatives that do not rely on {dplyover} and (iii) elaborate on the pros and cons.

Use cases and workarounds

In this section we will look at three different use cases of {dplyover}:

  1. Create columns derived from a comma separated character vector
  2. Creating several lagged variables for a set of columns
  3. Applying functions to a set of variable pairs

Before we begin, here is our setup:

library(dplyr)
library(tidyr)
library(purrr)
library(dplyover)
library(ggplot2)
library(bench)

Create columns derived from a comma separated character vector

{dplyover} helps us to think in terms of columns that we want to create. A good example, when this kind of thinking is useful, is when dealing with a comma separated character vector. In the example below the column ‘csat_open’ contains one or several comma separated reasons why a good or bad satisfaction rating was given. We want to transform the data so that each unique reason in ‘csat_open’ gets its own column in form of a dummy variable: taking 1 when the reason was given and 0 otherwise. To do this we use over together with dist_values which is a wrapper around unique that splits comma separated strings when a separator is provided in .sep:

csat %>%
  select(csat_open) %>% # for better printing
  mutate(over(dist_values(csat_open, .sep = ", "),
              ~ as.integer(grepl(.x, csat_open)),
              .names = "rsp_{x}",
              .names_fn = ~ gsub(" ", "_", .x))) %>%
 glimpse # for better printing
#> Rows: 150
#> Columns: 7
#> $ csat_open          <chr> "good service", "unfriendly", "too expensive, unfri~
#> $ rsp_friendly_staff <int> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, ~
#> $ rsp_good_service   <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, ~
#> $ rsp_great_product  <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, ~
#> $ rsp_no_response    <int> 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, ~
#> $ rsp_too_expensive  <int> 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, ~
#> $ rsp_unfriendly     <int> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, ~

over makes it easy to think about the problem we are dealing with. We just iterate over all unique values in csat_open and check with grepl if the value is present. The first two arguments would suffice to produce a useable result. Here, we additionally use the .names and .names_fn arguments to create nice column names by appending a prefix (‘rsp_’) and replacing the white spaces with underscores.

A tidyverse approach that comes close to over’s functionality is to use purrr::map_dfc inside mutate. The only difference is that we have to create a named vector, which holds the new column names as names, and the unique values of “csat_open” as values. This makes this approach less pipe-friendly.

var_nms <- strsplit(csat$csat_open, ", ") %>% unlist %>% unique %>%
  set_names(., paste0("rsp_", gsub(" ", "_", .)))

csat %>%
  transmute(map_dfc(var_nms,
              ~ as.integer(grepl(.x, csat_open))))

While purrr::map_dfc is a valid workaround, it is less known and therefore not often used in the wild (at least this is my anecdotal evidence that comes from looking at answers to similar questions on Stackoverflow).

A more common way to tackle this problem within the tidyverse is to use {tidyr}. Here we first use separate_rows and make the data longer. Then we use pivot_wider to generate the columns we are interested in. Although the code is not much longer, we have to do some thinking about how to set up pviot_wider to yield our desired result. And although we make use of a lot of pivot_wider’s arguments, the column names still require a finishing touch.

csat %>%
  separate_rows(csat_open, sep = ", ") %>% 
  pivot_wider(names_from = csat_open,
              names_prefix = "rsp_",
              names_repair = "universal",
              values_from = csat_open,
              values_fn = function(x) 1,
              values_fill = 0) %>%
  select(starts_with("rsp_")) %>% # for better printing
  glimpse # for better printing
#> New names:
#> * `rsp_good service` -> `rsp_good.service`
#> * `rsp_too expensive` -> `rsp_too.expensive`
#> * `rsp_no response` -> `rsp_no.response`
#> * `rsp_friendly staff` -> `rsp_friendly.staff`
#> * `rsp_great product` -> `rsp_great.product`
#> Rows: 150
#> Columns: 6
#> $ rsp_good.service   <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, ~
#> $ rsp_unfriendly     <dbl> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, ~
#> $ rsp_too.expensive  <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, ~
#> $ rsp_no.response    <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, ~
#> $ rsp_friendly.staff <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, ~
#> $ rsp_great.product  <dbl> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, ~

Another approach is to (i) get all unique values in “csat_open” and use them to (ii) create empty columns filled with zeros. Then (iii) we can use cur_column with grepl inside dplyr::across to test if the current column name is present in ‘csat_open’. This approach has two negative aspects. First, it is not pipe-friendly. Second, creating nice column names isn’t easy. This comes as a surprise, since dplyr::across has a .names argument. This lets us append a prefix, but it does not let us replace whitespaces with underscores. Further, once we make use of the .names argument, new copies of the input columns are generated, so we need to get rid of the old columns.

var_nms <- strsplit(csat$csat_open, ", ") %>% unlist %>% unique

# instead of the set_names below we could do the following in base R:
# csat[, var_nms] <- 0
csat %>% 
  transmute(!!! set_names(rep(0, length(var_nms)), var_nms), 
            across(all_of(var_nms), 
                   ~ grepl(cur_column(), csat_open) %>% as.integer,
                   .names = "rsp_{col}")) %>%
  select(!all_of(var_nms)) %>% 
  glimpse # for better printing
#> Rows: 150
#> Columns: 6
#> $ `rsp_good service`   <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0~
#> $ rsp_unfriendly       <int> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1~
#> $ `rsp_too expensive`  <int> 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1~
#> $ `rsp_no response`    <int> 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1~
#> $ `rsp_friendly staff` <int> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0~
#> $ `rsp_great product`  <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0~

If we look at the performance of each approach, we can see that map_dfc is fastest, while is pivot_wider is slowest. across and over range somewhere in between.

#> # A tibble: 4 x 3
#>   expression    median mem_alloc
#>   <bch:expr>  <bch:tm> <bch:byt>
#> 1 over          4.32ms    25.9KB
#> 2 map_dfc       3.66ms    24.6KB
#> 3 pivot_wider  10.49ms   139.8KB
#> 4 across         6.7ms    61.7KB

Creating several lagged variables for a set of columns

In exploratory data analysis creating a range of lagged variants of several variables is not an uncommon use case. Let’s take the diamonds data set from {ggplot2} and create five lagged versions of columns ‘x’, ‘y’ and ‘z’ by positions 1 to 5. crossover allows us to iterate a set of columns over a vector:

diamonds %>% 
  mutate(crossover(c(x,y,z),
                   1:5,
                   list(lag = ~ lag(.x, .y)),
                   .names = "{xcol}_{fn}{y}"))
#> # A tibble: 53,940 x 25
#>   carat cut     color clarity depth table price     x     y     z x_lag1 x_lag2
#>   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>  <dbl>
#> 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43  NA     NA   
#> 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31   3.95  NA   
#> 3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31   3.89   3.95
#> 4  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63   4.05   3.89
#> # i 53,936 more rows
#> # i 13 more variables: x_lag3 <dbl>, x_lag4 <dbl>, x_lag5 <dbl>, y_lag1 <dbl>,
#> #   y_lag2 <dbl>, y_lag3 <dbl>, y_lag4 <dbl>, y_lag5 <dbl>, ...

While this was easy, there are of course other ways to approach this problem. We could use dplyr::across and just insert the call to lag five times manually in the .fns argument. However, in this vignette we only consider approaches that save us from repeating code patterns. One option is using {purrr}’s map_dfc inside dplyr::across:

diamonds %>% 
  mutate(across(c(x,y,z),
                ~ map_dfc(set_names(1:5, paste0("lag", 1:5)),
                          function(y) lag(.x, y))
                )) %>% 
  do.call(data.frame, .)

This approach is pretty close to crossover, but we need to take care of two things. First, we have to create the variable names using purrr::set_names, otherwise our new variables would be named x...1, x...2 and so on. Second, the original output of calling purrr::map_dfc within dplyr::across on three columns are three tibbles, each containing the lagged variants of one column. To turn each tibble into five proper columns of our original data we need to pipe the final result into a do.call(data.frame, .). All in all, although staying in the tidyverse, the map call nested in across, together with set_names and the final do.call make this approach less readable.

Finally, another approach would be to build a custom function and use purrr::reduce2 to call several dplyr::mutate calls in a row, each with the input of the last call’s output.

create_lags <- function(df, .x, .y) {
    mutate(df, "{.x}_lag{.y}" := lag(!! sym(.x), .y))
}

diamonds %>% 
  reduce2(rep(c("x", "y", "z"), 5),
          rep(1:5, 3),
          create_lags,
          .init = .)

This approach is very clever, and {purrr}’s reduce function is a pipe-friendly approach which works great with {dplyr}’s one-table verbs. However, here too we have to take care of two things: First, using NSE in the custom function requires some advanced knowledge of the tidyverse. (i) How do we create nice names on the lefthand side of the walrus operator :=? (ii) How do we evaluate strings as column names? Second, figuring out how to setup reduce2 to work with our custom function. Our original data goes in to .init argument, the arguments we want to loop over need to be repeated in a way that ensures that each combination of elements is created.

Finally, we compare the performance of each operation. The table below shows that using map_dfc within across is the most performant approach, while using a custom function with reduce is the least performant. crossover is not too far off in terms of speed compared to the map_dfc approach. The big gain of using crossover is that it reduces the cognitive load of figuring out how to use advanced tidyverse workarounds (like reduce or a map call nested in across).

#> # A tibble: 3 x 3
#>   expression          median mem_alloc
#>   <bch:expr>        <bch:tm> <bch:byt>
#> 1 crossover           18.7ms      26MB
#> 2 across_map_dfc      15.6ms    15.4MB
#> 3 custom_fct_reduce     50ms    15.6MB

Applying functions to a set of variable pairs

Some data sets contain many variable with similar patterns. A ‘tiny’ example is the ìris data set, containing columns with combinations of ‘Sepal’ and ‘Petal’ with ‘Length’ and ‘Width’. In the example below we want to calculate the product and the sum of both pairs of ‘Length’ and ‘Width’ variables. For problems like like this across2 is the perfect tool. It works just like dplyr::across, but let’s us choose two sets of columns - .xcols and .ycols. across2 allows us to use a special glue specification, {pre}, in its .names argument, which will evaluate to the common prefix of each variable pair.

iris %>%
  mutate(across2(ends_with("Length"),
                 ends_with("Width"),
                 .fns = list(product = ~ .x * .y,
                             sum = ~ .x + .y),
                 .names = "{pre}_{fn}")) %>%
  glimpse
#> Rows: 150
#> Columns: 9
#> $ Sepal.Length  <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4~
#> $ Sepal.Width   <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3~
#> $ Petal.Length  <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1~
#> $ Petal.Width   <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0~
#> $ Species       <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, ~
#> $ Sepal_product <dbl> 18, 15, 15, 14, 18, 21, 16, 17, 13, 15, 20, 16, 14, 13, ~
#> $ Sepal_sum     <dbl> 8.6, 7.9, 7.9, 7.7, 8.6, 9.3, 8.0, 8.4, 7.3, 8.0, 9.1, 8~
#> $ Petal_product <dbl> 0.28, 0.28, 0.26, 0.30, 0.28, 0.68, 0.42, 0.30, 0.28, 0.~
#> $ Petal_sum     <dbl> 1.6, 1.6, 1.5, 1.7, 1.6, 2.1, 1.7, 1.7, 1.6, 1.6, 1.7, 1~

While in this case a simple call to mutate (defining two new varialbes) would have done the job, across2 really comes in handy when we are dealing with data sets with many variables that need to be computed in a similar way.

Again, there are different workarounds to produce the same result. One option is to use dplyr::across and dplyr::cur_column. First we shorten the names of ‘Sepal.Length’ and ‘Petal.Length’ to their stem using dplyr::rename_with. This allows us to use paste0 to construct the names of the ‘Width’ variables. Below we use get on the so constructed names to get the underlying values (alternatively we could have used eval(sym())). While this comes close to our desired result, we probably want to restore the original names. Unfortunately, this requires an additional setup, which is why we rename the variables manually below.

iris %>%
  rename_with(.fn = ~ gsub(".Length", "", .x),
              .cols = ends_with("Length")) %>% 
  mutate(across(c(Sepal, Petal), 
                list(product = ~ .x * get(paste0(cur_column(), ".Width")),
                     sum = ~ .x + get(paste0(cur_column(), ".Width"))))) %>% 
  rename(Sepal.Length = Sepal,
         Petal.Length = Petal)

Another approach is again {purrr}’s reduce2 in combination with a custom function. Apart from the data, df, our custom function takes two strings as arguments: .x is the common stem of our variable pairs, and .y which is the name of the operation we want to perform. The downside of this approach was already mentioned above: (i) advanced NSE knowledge and (ii) figuring out how to use reduce2, especially building the inputs using rep.

create_vars <- function(df, .x, .y) {
    mutate(df, "{.x}_{.y}" := switch(.y,
                                     sum = !! sym(paste0(.x, ".Width")) +
                                           !! sym(paste0(.x, ".Length")),
                                     product = !! sym(paste0(.x, ".Width")) *
                                               !! sym(paste0(.x, ".Length")))
    )
}

iris %>% 
  reduce2(rep(c("Sepal", "Petal"), each = 2),
          rep(c("product", "sum"), 2),
          create_vars,
          .init = .) 

Comparing the three operations above in terms of speed shows that dplyr::across is the most performant. This time, reduce2 follows closely and across2 is the least performant. Although being slower, across2 has several advantages:

  1. No knowledge of NSE or {purrr}’s reduce is needed.
  2. Its the most readable approach.
  3. It allows us to produce nice names, including a common pre- or suffix.
  4. It does not depend on the variables having to share a common stem.
#> # A tibble: 3 x 3
#>   expression          median mem_alloc
#>   <bch:expr>        <bch:tm> <bch:byt>
#> 1 across2            13.75ms   596.8KB
#> 2 across              5.34ms    72.3KB
#> 3 custom_fct_reduce   8.05ms      76KB

Conclusion

Above we have discussed three use cases of {dplyover}’s over-across function family. We saw that each use case could be solved with other tidyverse functions or workarounds. However, each alternative approach had it’s own drawback: Either creating nice column names wasn’t straight forward, or we had to get our head around NSE and advanced functions such as reduce. Most of the alternative approaches were also less readable and less pipe-friendly. But most of all, we saw that there is no uniform approach to a common problem: creating columns based on vectors and/or sets of columns.

For this specific problem the over-across function family:

  • provides a concise and uniform {dplyr} syntax similar to dplyr::across
  • helps to write clean, readable, and pipe-friendly code without repetitions
  • improves our mental model so that we can focus more on functions to create columns (and less on data rectangling)