---
title: "Why dplyover?"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Why dplyover?}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteDepends{tidyr}
%\VignetteDepends{bench}
%\VignetteDepends{ggplot2}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
options(
tibble.print_min = 4,
tibble.max_extra_cols = 8,
digits = 2,
crayon.enabled = FALSE,
cli.unicode = FALSE
)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
timings <- readRDS("why_bench.rds")
```
## Introduction
{dplyover} extends {dplyr}'s functionality by building a function family
around `dplyr::across()`.
The goal of this *over-across function family* is to provide a concise and
uniform syntax which can be used to create columns by applying functions to
vectors and/or sets of columns in {dplyr}. Ideally, this will:
- **reduce the amount of code** to create variables derived from existing colums,
which is especially helpful when doing explanatory data analysis (e.g. lagging,
collapsing, recoding etc. many variables in a similar way).
- **provide a clean {dplyr} approach** to create many variables which are
calculated based on two or more variables.
- **improve our mental model** so that it is easier to tackle problems where the
solution is based on creating new columns.
To illustrate the points above it is helpful to look at a couple of use cases.
This vignette will (i) look at four examples, (ii) present some alternatives
that do not rely on {dplyover} and (iii) elaborate on the pros and cons.
## Use cases and workarounds
In this section we will look at three different use cases of {dplyover}:
1. Create columns derived from a comma separated character vector
1. Creating several lagged variables for a set of columns
1. Applying functions to a set of variable pairs
Before we begin, here is our setup:
```{r setup, warning = FALSE, message = FALSE}
library(dplyr)
library(tidyr)
library(purrr)
library(dplyover)
library(ggplot2)
library(bench)
```
#### Create columns derived from a comma separated character vector
{dplyover} helps us to think in terms of columns that we want to create. A good
example, when this kind of thinking is useful, is when dealing with a comma
separated character vector. In the example below the column 'csat_open' contains
one or several comma separated reasons why a good or bad satisfaction rating was
given. We want to transform the data so that each unique reason in 'csat_open'
gets its own column in form of a dummy variable: taking `1` when the reason was
given and `0` otherwise. To do this we use `over` together with `dist_values`
which is a wrapper around `unique` that splits comma separated strings when a
separator is provided in `.sep`:
```{r, eval = TRUE}
csat %>%
select(csat_open) %>% # for better printing
mutate(over(dist_values(csat_open, .sep = ", "),
~ as.integer(grepl(.x, csat_open)),
.names = "rsp_{x}",
.names_fn = ~ gsub(" ", "_", .x))) %>%
glimpse # for better printing
```
`over` makes it easy to think about the problem we are dealing with. We just
iterate over all unique values in `csat_open` and check with `grepl` if the
value is present. The first two arguments would suffice to produce a useable
result. Here, we additionally use the `.names` and `.names_fn` arguments to
create nice column names by appending a prefix ('rsp_') and replacing the white
spaces with underscores.
A tidyverse approach that comes close to `over`'s functionality is to use
`purrr::map_dfc` inside `mutate`. The only difference is that we have to create
a named vector, which holds the new column names as names, and the unique
values of "csat_open" as values. This makes this approach less pipe-friendly.
```{r, eval = FALSE}
var_nms <- strsplit(csat$csat_open, ", ") %>% unlist %>% unique %>%
set_names(., paste0("rsp_", gsub(" ", "_", .)))
csat %>%
transmute(map_dfc(var_nms,
~ as.integer(grepl(.x, csat_open))))
```
While `purrr::map_dfc` is a valid workaround, it is less known and therefore not
often used in the wild (at least this is my anecdotal evidence that comes from
looking at answers to similar questions on Stackoverflow).
A more common way to tackle this problem within the tidyverse is to use
{tidyr}. Here we first use `separate_rows` and make the data longer. Then we use
`pivot_wider` to generate the columns we are interested in. Although the code is
not much longer, we have to do some thinking about how to set up `pviot_wider`
to yield our desired result. And although we make use of a lot of `pivot_wider`'s
arguments, the column names still require a finishing touch.
```{r, eval = TRUE}
csat %>%
separate_rows(csat_open, sep = ", ") %>%
pivot_wider(names_from = csat_open,
names_prefix = "rsp_",
names_repair = "universal",
values_from = csat_open,
values_fn = function(x) 1,
values_fill = 0) %>%
select(starts_with("rsp_")) %>% # for better printing
glimpse # for better printing
```
Another approach is to (i) get all unique values in "csat_open" and use them to
(ii) create empty columns filled with zeros. Then (iii) we can use `cur_column`
with `grepl` inside `dplyr::across` to test if the current column name is present
in 'csat_open'. This approach has two negative aspects. First, it is not
pipe-friendly. Second, creating nice column names isn't easy. This comes as a
surprise, since `dplyr::across` has a `.names` argument. This lets us append a
prefix, but it does not let us replace whitespaces with underscores. Further,
once we make use of the `.names` argument, new copies of the input columns are
generated, so we need to get rid of the old columns.
```{r, eval = TRUE}
var_nms <- strsplit(csat$csat_open, ", ") %>% unlist %>% unique
# instead of the set_names below we could do the following in base R:
# csat[, var_nms] <- 0
csat %>%
transmute(!!! set_names(rep(0, length(var_nms)), var_nms),
across(all_of(var_nms),
~ grepl(cur_column(), csat_open) %>% as.integer,
.names = "rsp_{col}")) %>%
select(!all_of(var_nms)) %>%
glimpse # for better printing
```
If we look at the performance of each approach, we can see that `map_dfc` is
fastest, while is `pivot_wider` is slowest. `across` and `over` range somewhere
in between.
```{r, echo = FALSE, eval = TRUE, warning = FALSE, message = FALSE}
csat2 <- csat
# over2_bench <- bench::mark(iterations = 50L, check = FALSE,
# over = {
# csat %>%
# transmute(over(dist_values(csat_open, .sep = ", "),
# ~ as.integer(grepl(.x, csat_open)),
# .names = "rsp_{x}",
# .names_fn = ~ gsub(" ", "_", .x)))
# },
# map_dfc = {
# local({var_nms <- strsplit(csat$csat_open, ", ") %>% unlist %>% unique %>%
# set_names(., paste0("rsp_", gsub(" ", "_", .)))
#
# csat %>%
# transmute(map_dfc(var_nms,
# ~ as.integer(grepl(.x, csat_open))))
# })},
# pivot_wider = {
# csat %>%
# separate_rows(csat_open, sep = ", ") %>%
# pivot_wider(names_from = csat_open,
# names_prefix = "rsp_",
# names_repair = "universal",
# values_from = csat_open,
# values_fn = function(x) 1,
# values_fill = 0)
# },
# across = {
# local({var_nms2 <- strsplit(csat$csat_open, ", ") %>% unlist %>% unique;
# csat2[, var_nms2] <- 0
# csat2 %>%
# mutate(across(all_of(var_nms2),
# ~ grepl(cur_column(), csat_open) %>% as.integer,
# .names = "rsp_{col}")) %>%
# select(!all_of(var_nms2))
# })}
# )
# over2_bench %>%
timings$over %>%
select(expression, median, mem_alloc)
```
#### Creating several lagged variables for a set of columns
In exploratory data analysis creating a range of lagged variants of several
variables is not an uncommon use case. Let's take the `diamonds` data set from
{ggplot2} and create five lagged versions of columns 'x', 'y' and 'z' by
positions `1` to `5`. `crossover` allows us to iterate a set of columns over a
vector:
```{r}
diamonds %>%
mutate(crossover(c(x,y,z),
1:5,
list(lag = ~ lag(.x, .y)),
.names = "{xcol}_{fn}{y}"))
```
While this was easy, there are of course other ways to approach this problem. We
could use `dplyr::across` and just insert the call to `lag` five times manually
in the `.fns` argument. However, in this vignette we only consider approaches
that save us from repeating code patterns. One option is using {purrr}'s
`map_dfc` inside `dplyr::across`:
```{r, eval = FALSE}
diamonds %>%
mutate(across(c(x,y,z),
~ map_dfc(set_names(1:5, paste0("lag", 1:5)),
function(y) lag(.x, y))
)) %>%
do.call(data.frame, .)
```
This approach is pretty close to `crossover`, but we need to take care of two
things. First, we have to create the variable names using `purrr::set_names`,
otherwise our new variables would be named `x...1`, `x...2` and so on. Second,
the original output of calling `purrr::map_dfc` within `dplyr::across` on three
columns are three `tibble`s, each containing the lagged variants of one column.
To turn each `tibble` into five proper columns of our original data we need to
pipe the final result into a `do.call(data.frame, .)`. All in all, although
staying in the tidyverse, the `map` call nested in `across`, together
with `set_names` and the final `do.call` make this approach less readable.
Finally, another approach would be to build a custom function and use
`purrr::reduce2` to call several `dplyr::mutate` calls in a row, each with the
input of the last call's output.
```{r, eval = FALSE}
create_lags <- function(df, .x, .y) {
mutate(df, "{.x}_lag{.y}" := lag(!! sym(.x), .y))
}
diamonds %>%
reduce2(rep(c("x", "y", "z"), 5),
rep(1:5, 3),
create_lags,
.init = .)
```
This approach is very clever, and {purrr}'s `reduce` function is a pipe-friendly
approach which works great with {dplyr}'s one-table verbs. However, here too we
have to take care of two things: First, using NSE in the custom function requires
some advanced knowledge of the tidyverse. (i) How do we create nice names on the
lefthand side of the walrus operator `:=`? (ii) How do we evaluate strings as
column names? Second, figuring out how to setup `reduce2` to work with our custom
function. Our original data goes in to `.init` argument, the arguments we want
to loop over need to be repeated in a way that ensures that each combination of
elements is created.
Finally, we compare the performance of each operation. The table below shows that
using `map_dfc` within `across` is the most performant approach, while using a
custom function with `reduce` is the least performant. `crossover` is not too far
off in terms of speed compared to the `map_dfc` approach. The big gain of using
`crossover` is that it reduces the cognitive load of figuring out how to use
advanced tidyverse workarounds (like `reduce` or a `map` call nested in `across`).
```{r, echo = FALSE, eval = TRUE, warning = FALSE, message = FALSE}
# create_lags <- function(df, .x, .y) {
# mutate(df, "{.x}_lag{.y}" := lag(!! sym(.x), .y))
# }
#
# crossover_bench <- bench::mark(iterations = 50L, check = FALSE,
# crossover = {
# diamonds %>%
# mutate(crossover(c(x,y,z),
# 1:5,
# list(lag = ~ lag(.x, .y)),
# .names = "{xcol}_{fn}{y}"))
# },
# across_map_dfc = {
# diamonds %>%
# mutate(across(c(x,y,z),
# ~ map_dfc(set_names(1:5, paste0("lag", 1:5)),
# function(y) lag(.x, y)))) %>%
# do.call(data.frame, .)
# },
#
# custom_fct_reduce = {
# diamonds %>%
# reduce2(rep(c("x", "y", "z"), 5),
# rep(1:5, 3),
# create_lags,
# .init = .)
# }
# )
# crossover_bench %>%
timings$crossover %>%
select(expression, median, mem_alloc)
```
#### Applying functions to a set of variable pairs
Some data sets contain many variable with similar patterns. A 'tiny' example is
the `ìris` data set, containing columns with combinations of 'Sepal' and 'Petal'
with 'Length' and 'Width'. In the example below we want to calculate the product
and the sum of both pairs of 'Length' and 'Width' variables. For problems like
like this `across2` is the perfect tool. It works just like `dplyr::across`,
but let's us choose *two* sets of columns - `.xcols` and `.ycols`. `across2`
allows us to use a special glue specification, `{pre}`, in its `.names` argument,
which will evaluate to the common prefix of each variable pair.
```{r}
iris %>%
mutate(across2(ends_with("Length"),
ends_with("Width"),
.fns = list(product = ~ .x * .y,
sum = ~ .x + .y),
.names = "{pre}_{fn}")) %>%
glimpse
```
While in this case a simple call to `mutate` (defining two new varialbes) would
have done the job, `across2` really comes in handy when we are dealing with data
sets with many variables that need to be computed in a similar way.
Again, there are different workarounds to produce the same result. One option is
to use `dplyr::across` and `dplyr::cur_column`. First we shorten the names of
'Sepal.Length' and 'Petal.Length' to their stem using `dplyr::rename_with`.
This allows us to use `paste0` to construct the names of the 'Width' variables.
Below we use `get` on the so constructed names to get the underlying values
(alternatively we could have used `eval(sym())`). While this comes close to our
desired result, we probably want to restore the original names. Unfortunately,
this requires an additional setup, which is why we rename the variables manually
below.
```{r, eval = FALSE}
iris %>%
rename_with(.fn = ~ gsub(".Length", "", .x),
.cols = ends_with("Length")) %>%
mutate(across(c(Sepal, Petal),
list(product = ~ .x * get(paste0(cur_column(), ".Width")),
sum = ~ .x + get(paste0(cur_column(), ".Width"))))) %>%
rename(Sepal.Length = Sepal,
Petal.Length = Petal)
```
Another approach is again {purrr}'s `reduce2` in combination with a custom
function. Apart from the data, `df`, our custom function takes two strings as
arguments: `.x` is the common stem of our variable pairs, and `.y` which is the
name of the operation we want to perform. The downside of this approach was
already mentioned above: (i) advanced NSE knowledge and (ii) figuring out how
to use `reduce2`, especially building the inputs using `rep`.
```{r, eval = FALSE}
create_vars <- function(df, .x, .y) {
mutate(df, "{.x}_{.y}" := switch(.y,
sum = !! sym(paste0(.x, ".Width")) +
!! sym(paste0(.x, ".Length")),
product = !! sym(paste0(.x, ".Width")) *
!! sym(paste0(.x, ".Length")))
)
}
iris %>%
reduce2(rep(c("Sepal", "Petal"), each = 2),
rep(c("product", "sum"), 2),
create_vars,
.init = .)
```
Comparing the three operations above in terms of speed shows that
`dplyr::across` is the most performant. This time, `reduce2` follows
closely and `across2` is the least performant. Although being slower, `across2`
has several advantages:
1. No knowledge of NSE or {purrr}'s `reduce` is needed.
1. Its the most readable approach.
1. It allows us to produce nice names, including a common pre- or suffix.
1. It does not depend on the variables having to share a common stem.
```{r, echo = FALSE, eval = TRUE, warning = FALSE, message = FALSE}
# create_vars <- function(df, .x, .y) {
# mutate(df, "{.x}_{.y}" := switch(.y,
# sum = !! sym(paste0(.x, ".Width")) + !! sym(paste0(.x, ".Length")),
# product = !! sym(paste0(.x, ".Width")) * !! sym(paste0(.x, ".Length")))
# )
# }
#
# across2_bench <- bench::mark(iterations = 50L, check = FALSE,
# across2 = {
# iris %>%
# mutate(across2(ends_with("Length"),
# ends_with("Width"),
# .fns = list(product = ~ .x * .y,
# sum = ~ .x + .y),
# .names = "{pre}_{fn}"))
# },
# across = {
# iris %>%
# rename_with(.fn = ~ gsub(".Length", "", .x),
# .cols = ends_with("Length")) %>%
# mutate(across(c(Sepal, Petal),
# list(product = ~ .x * get(paste0(cur_column(), ".Width")),
# sum = ~ .x + get(paste0(cur_column(), ".Width"))))) %>%
# rename(Sepal.Length = Sepal,
# Petal.Length = Petal)
# },
# custom_fct_reduce = {
# iris %>%
# reduce2(rep(c("Sepal", "Petal"), each = 2),
# rep(c("product", "sum"), 2),
# create_vars,
# .init = .)
# }
# )
# across2_bench %>%
# saveRDS(list(over2_bench, crossover_bench, across2_bench), "vignettes/why_bench.rds")
timings$across2 %>%
select(expression, median, mem_alloc)
```
## Conclusion
Above we have discussed three use cases of {dplyover}'s
*over-across function family*. We saw that each use case could be solved with
other tidyverse functions or workarounds. However, each alternative approach
had it's own drawback: Either creating nice column names wasn't straight forward,
or we had to get our head around NSE and advanced functions such as `reduce`.
Most of the alternative approaches were also less readable and less pipe-friendly.
But most of all, we saw that there is no uniform approach to a common problem:
creating columns based on vectors and/or sets of columns.
For this specific problem the *over-across function family*:
- provides a concise and uniform {dplyr} syntax similar to `dplyr::across`
- helps to write clean, readable, and pipe-friendly code without repetitions
- improves our mental model so that we can focus more on functions to create columns (and less on data rectangling)