{dplyover} extends {dplyr}’s functionality by building a function
family around dplyr::across()
.
The goal of this over-across function family is to provide a concise and uniform syntax which can be used to create columns by applying functions to vectors and/or sets of columns in {dplyr}. Ideally, this will:
To illustrate the points above it is helpful to look at a couple of use cases. This vignette will (i) look at four examples, (ii) present some alternatives that do not rely on {dplyover} and (iii) elaborate on the pros and cons.
In this section we will look at three different use cases of {dplyover}:
Before we begin, here is our setup:
{dplyover} helps us to think in terms of columns that we want to
create. A good example, when this kind of thinking is useful, is when
dealing with a comma separated character vector. In the example below
the column ‘csat_open’ contains one or several comma separated reasons
why a good or bad satisfaction rating was given. We want to transform
the data so that each unique reason in ‘csat_open’ gets its own column
in form of a dummy variable: taking 1
when the reason was
given and 0
otherwise. To do this we use over
together with dist_values
which is a wrapper around
unique
that splits comma separated strings when a separator
is provided in .sep
:
csat %>%
select(csat_open) %>% # for better printing
mutate(over(dist_values(csat_open, .sep = ", "),
~ as.integer(grepl(.x, csat_open)),
.names = "rsp_{x}",
.names_fn = ~ gsub(" ", "_", .x))) %>%
glimpse # for better printing
#> Rows: 150
#> Columns: 7
#> $ csat_open <chr> "good service", "unfriendly", "too expensive, unfri~
#> $ rsp_friendly_staff <int> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, ~
#> $ rsp_good_service <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, ~
#> $ rsp_great_product <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, ~
#> $ rsp_no_response <int> 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, ~
#> $ rsp_too_expensive <int> 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, ~
#> $ rsp_unfriendly <int> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, ~
over
makes it easy to think about the problem we are
dealing with. We just iterate over all unique values in
csat_open
and check with grepl
if the value is
present. The first two arguments would suffice to produce a useable
result. Here, we additionally use the .names
and
.names_fn
arguments to create nice column names by
appending a prefix (‘rsp_’) and replacing the white spaces with
underscores.
A tidyverse approach that comes close to over
’s
functionality is to use purrr::map_dfc
inside
mutate
. The only difference is that we have to create a
named vector, which holds the new column names as names, and the unique
values of “csat_open” as values. This makes this approach less
pipe-friendly.
var_nms <- strsplit(csat$csat_open, ", ") %>% unlist %>% unique %>%
set_names(., paste0("rsp_", gsub(" ", "_", .)))
csat %>%
transmute(map_dfc(var_nms,
~ as.integer(grepl(.x, csat_open))))
While purrr::map_dfc
is a valid workaround, it is less
known and therefore not often used in the wild (at least this is my
anecdotal evidence that comes from looking at answers to similar
questions on Stackoverflow).
A more common way to tackle this problem within the tidyverse is to
use {tidyr}. Here we first use separate_rows
and make the
data longer. Then we use pivot_wider
to generate the
columns we are interested in. Although the code is not much longer, we
have to do some thinking about how to set up pviot_wider
to
yield our desired result. And although we make use of a lot of
pivot_wider
’s arguments, the column names still require a
finishing touch.
csat %>%
separate_rows(csat_open, sep = ", ") %>%
pivot_wider(names_from = csat_open,
names_prefix = "rsp_",
names_repair = "universal",
values_from = csat_open,
values_fn = function(x) 1,
values_fill = 0) %>%
select(starts_with("rsp_")) %>% # for better printing
glimpse # for better printing
#> New names:
#> * `rsp_good service` -> `rsp_good.service`
#> * `rsp_too expensive` -> `rsp_too.expensive`
#> * `rsp_no response` -> `rsp_no.response`
#> * `rsp_friendly staff` -> `rsp_friendly.staff`
#> * `rsp_great product` -> `rsp_great.product`
#> Rows: 150
#> Columns: 6
#> $ rsp_good.service <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, ~
#> $ rsp_unfriendly <dbl> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, ~
#> $ rsp_too.expensive <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, ~
#> $ rsp_no.response <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, ~
#> $ rsp_friendly.staff <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, ~
#> $ rsp_great.product <dbl> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, ~
Another approach is to (i) get all unique values in “csat_open” and
use them to (ii) create empty columns filled with zeros. Then (iii) we
can use cur_column
with grepl
inside
dplyr::across
to test if the current column name is present
in ‘csat_open’. This approach has two negative aspects. First, it is not
pipe-friendly. Second, creating nice column names isn’t easy. This comes
as a surprise, since dplyr::across
has a
.names
argument. This lets us append a prefix, but it does
not let us replace whitespaces with underscores. Further, once we make
use of the .names
argument, new copies of the input columns
are generated, so we need to get rid of the old columns.
var_nms <- strsplit(csat$csat_open, ", ") %>% unlist %>% unique
# instead of the set_names below we could do the following in base R:
# csat[, var_nms] <- 0
csat %>%
transmute(!!! set_names(rep(0, length(var_nms)), var_nms),
across(all_of(var_nms),
~ grepl(cur_column(), csat_open) %>% as.integer,
.names = "rsp_{col}")) %>%
select(!all_of(var_nms)) %>%
glimpse # for better printing
#> Rows: 150
#> Columns: 6
#> $ `rsp_good service` <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0~
#> $ rsp_unfriendly <int> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1~
#> $ `rsp_too expensive` <int> 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1~
#> $ `rsp_no response` <int> 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1~
#> $ `rsp_friendly staff` <int> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0~
#> $ `rsp_great product` <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0~
If we look at the performance of each approach, we can see that
map_dfc
is fastest, while is pivot_wider
is
slowest. across
and over
range somewhere in
between.
#> # A tibble: 4 x 3
#> expression median mem_alloc
#> <bch:expr> <bch:tm> <bch:byt>
#> 1 over 4.32ms 25.9KB
#> 2 map_dfc 3.66ms 24.6KB
#> 3 pivot_wider 10.49ms 139.8KB
#> 4 across 6.7ms 61.7KB
In exploratory data analysis creating a range of lagged variants of
several variables is not an uncommon use case. Let’s take the
diamonds
data set from {ggplot2} and create five lagged
versions of columns ‘x’, ‘y’ and ‘z’ by positions 1
to
5
. crossover
allows us to iterate a set of
columns over a vector:
diamonds %>%
mutate(crossover(c(x,y,z),
1:5,
list(lag = ~ lag(.x, .y)),
.names = "{xcol}_{fn}{y}"))
#> # A tibble: 53,940 x 25
#> carat cut color clarity depth table price x y z x_lag1 x_lag2
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 NA NA
#> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3.95 NA
#> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3.89 3.95
#> 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 4.05 3.89
#> # i 53,936 more rows
#> # i 13 more variables: x_lag3 <dbl>, x_lag4 <dbl>, x_lag5 <dbl>, y_lag1 <dbl>,
#> # y_lag2 <dbl>, y_lag3 <dbl>, y_lag4 <dbl>, y_lag5 <dbl>, ...
While this was easy, there are of course other ways to approach this
problem. We could use dplyr::across
and just insert the
call to lag
five times manually in the .fns
argument. However, in this vignette we only consider approaches that
save us from repeating code patterns. One option is using {purrr}’s
map_dfc
inside dplyr::across
:
diamonds %>%
mutate(across(c(x,y,z),
~ map_dfc(set_names(1:5, paste0("lag", 1:5)),
function(y) lag(.x, y))
)) %>%
do.call(data.frame, .)
This approach is pretty close to crossover
, but we need
to take care of two things. First, we have to create the variable names
using purrr::set_names
, otherwise our new variables would
be named x...1
, x...2
and so on. Second, the
original output of calling purrr::map_dfc
within
dplyr::across
on three columns are three
tibble
s, each containing the lagged variants of one column.
To turn each tibble
into five proper columns of our
original data we need to pipe the final result into a
do.call(data.frame, .)
. All in all, although staying in the
tidyverse, the map
call nested in across
,
together with set_names
and the final do.call
make this approach less readable.
Finally, another approach would be to build a custom function and use
purrr::reduce2
to call several dplyr::mutate
calls in a row, each with the input of the last call’s output.
create_lags <- function(df, .x, .y) {
mutate(df, "{.x}_lag{.y}" := lag(!! sym(.x), .y))
}
diamonds %>%
reduce2(rep(c("x", "y", "z"), 5),
rep(1:5, 3),
create_lags,
.init = .)
This approach is very clever, and {purrr}’s reduce
function is a pipe-friendly approach which works great with {dplyr}’s
one-table verbs. However, here too we have to take care of two things:
First, using NSE in the custom function requires some advanced knowledge
of the tidyverse. (i) How do we create nice names on the lefthand side
of the walrus operator :=
? (ii) How do we evaluate strings
as column names? Second, figuring out how to setup reduce2
to work with our custom function. Our original data goes in to
.init
argument, the arguments we want to loop over need to
be repeated in a way that ensures that each combination of elements is
created.
Finally, we compare the performance of each operation. The table
below shows that using map_dfc
within across
is the most performant approach, while using a custom function with
reduce
is the least performant. crossover
is
not too far off in terms of speed compared to the map_dfc
approach. The big gain of using crossover
is that it
reduces the cognitive load of figuring out how to use advanced tidyverse
workarounds (like reduce
or a map
call nested
in across
).
#> # A tibble: 3 x 3
#> expression median mem_alloc
#> <bch:expr> <bch:tm> <bch:byt>
#> 1 crossover 18.7ms 26MB
#> 2 across_map_dfc 15.6ms 15.4MB
#> 3 custom_fct_reduce 50ms 15.6MB
Some data sets contain many variable with similar patterns. A ‘tiny’
example is the ìris
data set, containing columns with
combinations of ‘Sepal’ and ‘Petal’ with ‘Length’ and ‘Width’. In the
example below we want to calculate the product and the sum of both pairs
of ‘Length’ and ‘Width’ variables. For problems like like this
across2
is the perfect tool. It works just like
dplyr::across
, but let’s us choose two sets of
columns - .xcols
and .ycols
.
across2
allows us to use a special glue specification,
{pre}
, in its .names
argument, which will
evaluate to the common prefix of each variable pair.
iris %>%
mutate(across2(ends_with("Length"),
ends_with("Width"),
.fns = list(product = ~ .x * .y,
sum = ~ .x + .y),
.names = "{pre}_{fn}")) %>%
glimpse
#> Rows: 150
#> Columns: 9
#> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4~
#> $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3~
#> $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1~
#> $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0~
#> $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, ~
#> $ Sepal_product <dbl> 18, 15, 15, 14, 18, 21, 16, 17, 13, 15, 20, 16, 14, 13, ~
#> $ Sepal_sum <dbl> 8.6, 7.9, 7.9, 7.7, 8.6, 9.3, 8.0, 8.4, 7.3, 8.0, 9.1, 8~
#> $ Petal_product <dbl> 0.28, 0.28, 0.26, 0.30, 0.28, 0.68, 0.42, 0.30, 0.28, 0.~
#> $ Petal_sum <dbl> 1.6, 1.6, 1.5, 1.7, 1.6, 2.1, 1.7, 1.7, 1.6, 1.6, 1.7, 1~
While in this case a simple call to mutate
(defining two
new varialbes) would have done the job, across2
really
comes in handy when we are dealing with data sets with many variables
that need to be computed in a similar way.
Again, there are different workarounds to produce the same result.
One option is to use dplyr::across
and
dplyr::cur_column
. First we shorten the names of
‘Sepal.Length’ and ‘Petal.Length’ to their stem using
dplyr::rename_with
. This allows us to use
paste0
to construct the names of the ‘Width’ variables.
Below we use get
on the so constructed names to get the
underlying values (alternatively we could have used
eval(sym())
). While this comes close to our desired result,
we probably want to restore the original names. Unfortunately, this
requires an additional setup, which is why we rename the variables
manually below.
iris %>%
rename_with(.fn = ~ gsub(".Length", "", .x),
.cols = ends_with("Length")) %>%
mutate(across(c(Sepal, Petal),
list(product = ~ .x * get(paste0(cur_column(), ".Width")),
sum = ~ .x + get(paste0(cur_column(), ".Width"))))) %>%
rename(Sepal.Length = Sepal,
Petal.Length = Petal)
Another approach is again {purrr}’s reduce2
in
combination with a custom function. Apart from the data,
df
, our custom function takes two strings as arguments:
.x
is the common stem of our variable pairs, and
.y
which is the name of the operation we want to perform.
The downside of this approach was already mentioned above: (i) advanced
NSE knowledge and (ii) figuring out how to use reduce2
,
especially building the inputs using rep
.
create_vars <- function(df, .x, .y) {
mutate(df, "{.x}_{.y}" := switch(.y,
sum = !! sym(paste0(.x, ".Width")) +
!! sym(paste0(.x, ".Length")),
product = !! sym(paste0(.x, ".Width")) *
!! sym(paste0(.x, ".Length")))
)
}
iris %>%
reduce2(rep(c("Sepal", "Petal"), each = 2),
rep(c("product", "sum"), 2),
create_vars,
.init = .)
Comparing the three operations above in terms of speed shows that
dplyr::across
is the most performant. This time,
reduce2
follows closely and across2
is the
least performant. Although being slower, across2
has
several advantages:
reduce
is needed.#> # A tibble: 3 x 3
#> expression median mem_alloc
#> <bch:expr> <bch:tm> <bch:byt>
#> 1 across2 13.75ms 596.8KB
#> 2 across 5.34ms 72.3KB
#> 3 custom_fct_reduce 8.05ms 76KB
Above we have discussed three use cases of {dplyover}’s
over-across function family. We saw that each use case could be
solved with other tidyverse functions or workarounds. However, each
alternative approach had it’s own drawback: Either creating nice column
names wasn’t straight forward, or we had to get our head around NSE and
advanced functions such as reduce
. Most of the alternative
approaches were also less readable and less pipe-friendly. But most of
all, we saw that there is no uniform approach to a common problem:
creating columns based on vectors and/or sets of columns.
For this specific problem the over-across function family:
dplyr::across