Column-wise and row-wise operations in dplyr and tidyverse
Published:
With the development of dplyr or its umbrella package tidyverse, it becomes quite straightforward to perform operations over columns or rows in R. These column- or row-wise methods can also be directly integrated with other dplyr verbs like select
, mutate
, filter
and summarise
, making them more comparable with other functions in apply
or map
families. In this blog, I will briefly cover some useful manipulations over rows or columns.
1. Column-wise operation
Example 1: select those string columns with less than 5 levels in the dataset of starwars.
# anonymous function
starwars %>%
select_if(~ is.character(.x) & length(unique(.x)) <= 5)
We can use a convenient function select_if
to identify certain columns by multiple criteria. This is essentially equivalent to the following expanded format.
# expanded form
starwars %>%
select_if(function(x) is.character(x) & length(unique(x)) <= 5)
It is worth noting that we are using tilde ($\sim$) to define an anonymous function, and thus we should use $.x$ to refer to the selected columns. See this link for detailed illustration of tilde ($\sim$), dot ($.$) and dot x ($.x$) in dplyr. If there is only one single argument in the anonymous function (in most cases), you can use dot (.) or placeholder (…1) as alternatives. In other words, the dot (.), placeholder (…1) and dot x (.x) are equivalents in a single argument anonymous function. For details, see the underlying function specifications viaas_mapper
.
as_mapper(~ length(unique(.x)))
# <lambda>
# function (..., .x = ..1, .y = ..2, . = ..1)
# length(unique(.x))
# attr(,"class")
# [1] "rlang_lambda_function" "function"
If you want to calculate the levels of those selected columns, you can try across
function and summarise
the number of levels by column.
# solution 1
starwars %>%
summarise(across(where(is.character), ~ length(unique(.x))))
# solution 2 (recommended)
starwars %>%
summarise_if(is.character, ~ length(unique(.x)))
Alternatively, you can make use of the map
or map_dbl
function in purrr by the following command. Note that when a map
function is applied to a data.frame, it will operate over columns by default.
# solution 3
starwars %>%
select_if(~ is.character(.x)) %>%
map_dbl(~length(unique(.x)))
Example 2: select those numeric columns and calculate the means and sds across columns in the dataset of starwars.
# solution 1
starwars %>%
summarise(across(where(~ is.numeric(.x)),
list(Mean = ~ mean(.x, na.rm = T),
Sd = ~ sd(.x, na.rm = T))))
This example provides us a good illustration of the use of dot x (.x) in dplyr style syntax, since we have some missing values (NAs) in certain columns. Thus, we need to specify the parameter with na.rm = T
inside the functions.
There is indeed a more convenient and elegant way of solving this by using the function of summarise_if
. It allows us to select certain columns and operate by columns like this:
# solution 2 (recommended)
starwars %>%
summarise_if(is.numeric,
list(Sum = sum, Mean = mean, Sd = sd),
na.rm = T)
If you are more comfortable with the map
function in purrr, you can resolve this issue in a different way. Note: we are mapping three functions to a data.frame, and combining them as named vectors. This will return a list of named vectors.
# solution 3
starwars %>%
select_if(is.numeric) %>%
map(., ~ c(Sum = sum(.x, na.rm = T),
Mean = mean(.x, na.rm = T),
Sd = sd(.x, na.rm = T)))
2. Row-wise operation
Example 3: calculate the sums, means and sds for each row for the dataset of iris.
iris %>%
rowwise() %>%
mutate(
Rowsum = sum(c_across(Sepal.Length:Petal.Width)),
Rowsd = sd(c_across(Sepal.Length:Petal.Width)),
Rowmean = mean(c_across(Sepal.Length:Petal.Width))
) %>%
ungroup()
Here the function c_across
is specifically designed to work with rowwise
operations. Note: rowwise
groups your data by row (class: rowwise_df), and it is best to ungroup
immediately. Of course, if you are more comfortable with apply
function, you can also use the following command.
iris %>%
select(Sepal.Length:Petal.Width) %>%
apply(., 1, function(x) c(sum(x), sd(x), mean(x))) %>%
as.tibble() %>% t()
Useful links