(Much) faster unnesting with data.table

Today I was struggling with a relatively simple operation: unnest() from the tidyr package. What it’s supposed to do is pretty simple. When you have a data.frame where one or multiple columns are lists, you can unlist these columns while duplicating the information in other columns if the length of an element is larger than 1.

library(tibble)
df <- tibble(
  a = LETTERS[1:5],
  b = LETTERS[6:10],
  list_column = list(c(LETTERS[1:5]), "F", "G", "H", "I")
)
df
## # A tibble: 5 x 3
##   a     b     list_column
##   <chr> <chr> <list>     
## 1 A     F     <chr [5]>  
## 2 B     G     <chr [1]>  
## 3 C     H     <chr [1]>  
## 4 D     I     <chr [1]>  
## 5 E     J     <chr [1]>
library(tidyr)
unnest(df, list_column)
## # A tibble: 9 x 3
##   a     b     list_column
##   <chr> <chr> <chr>      
## 1 A     F     A          
## 2 A     F     B          
## 3 A     F     C          
## 4 A     F     D          
## 5 A     F     E          
## 6 B     G     F          
## 7 C     H     G          
## 8 D     I     H          
## 9 E     J     I

I came across this a lot while working on data from Twitter since individual tweets can contain multiple hashtags, mentions, URLs and so on, which is why they are stored in lists. unnest() is really helpful and very flexible in my experience since it makes creating, for example, a table of top 10 hashtags a piece of cake.

However, on large datasets, unnest() has its limitations (as I found out today). On a set with 1.8 million tweets, I was barely able to unnest the URL column and it would take forever on my laptop or simply crash at some point. In a completely new environment, unnesting the data took half an hour.

So let’s cut this time down to 10 seconds with data.table. In data.table, you would unlist like this1:

library(data.table)
dt <- as.data.table(df)
dt[, list(list_column = as.character(unlist(list_column))), by = list(a, b)]
##    a b list_column
## 1: A F           A
## 2: A F           B
## 3: A F           C
## 4: A F           D
## 5: A F           E
## 6: B G           F
## 7: C H           G
## 8: D I           H
## 9: E J           I

This is quite a bit longer than the tidyr code. So I wrapped it in a short function (note, that most of the code deals with quasiquotation so we can use it the same way as the original unnest()):

library(rlang)
unnest_dt <- function(tbl, col) {

  tbl <- as.data.table(tbl)

  col <- ensyms(col)

  clnms <- syms(setdiff(colnames(tbl), as.character(col)))

  tbl <- as.data.table(tbl)

  tbl <- eval(
    expr(tbl[, as.character(unlist(!!!col)), by = list(!!!clnms)])
  )

  colnames(tbl) <- c(as.character(clnms), as.character(col))

  tbl
}

On the surface, it does the same as unnest:

unnest_dt(df, list_column)
##    a b list_column
## 1: A F           A
## 2: A F           B
## 3: A F           C
## 4: A F           D
## 5: A F           E
## 6: B G           F
## 7: C H           G
## 8: D I           H
## 9: E J           I

But the function is extremely fast and lean. To show this, I do some benchmarking on a larger object. I scale the example ‘data.frame’ up from 5 to 50,000 rows since the overhead of loading a function will influence runtime much stronger on small-n data.

library(bench)
df_large <- dplyr::sample_frac(df, 10000, replace = TRUE)
res <- mark(
  tidyr = unnest(df_large, list_column),
  dt = unnest_dt(df_large, list_column)
)
res
## # A tibble: 2 x 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 tidyr         52.4s    52.4s    0.0191   16.77GB     6.38
## 2 dt           14.3ms   18.5ms   50.0       9.56MB    10.00
summary(res, relative = TRUE)
## # A tibble: 2 x 6
##   expression   min median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
## 1 tidyr      3666.  2832.        1      1796.     1   
## 2 dt            1      1      2617.        1      1.57

As you can see, data.table is 3666 times faster. That is pretty insane. But what is often even more important, the memory consumption is negligible with the data.table function compared to tidyr. When trying to unnest my Twitter dataset with 1.8 million tweets, my computer would choke on the memory issue and even throw an error if I had some other large objects loaded.

Admittedly the function is not perfect. It is far less flexible than unnest, especially since it only runs on one variable at the time. However, this covers 95% of my usage of unnest and I would only consider including it in a script if performance is key. Or if the great people at data.table write their own unnest some day.

Update 28/10/2019

As cuttlefish44 pointed out in the comments, the rather extreme performance difference was partly caused by an issue specific to tidyr version 1.0.0, which has been partly resolved already2. When running the same benchmarks as above with tidyr 0.8.3, data.table was only around 13 times faster; when running with the development version of vctrs, data.table was around 250 times faster.

Update 20/11/2019

ismirsehregal pointed out that one of the shortcomings of the original function (that it can only deal with one column at a time), can be solved more easily than I thought:

unnest_dt2 <- function(tbl, ...) {

  tbl <- as.data.table(tbl)

  col <- ensyms(...)

  clnms <- syms(setdiff(colnames(tbl), as.character(col)))

  tbl <- as.data.table(tbl)

  tbl <- eval(
    expr(tbl[, lapply(.SD, unlist), by = list(!!!clnms), .SDcols = as.character(col)])
  )
  
  colnames(tbl) <- c(as.character(clnms), as.character(col))

  tbl
}

This function can deal with multiple columns at a time:

DT <- data.table(
  a = LETTERS[1:5],
  b = LETTERS[6:10],
  list_column1 = list(c(LETTERS[1:5]), "F", "G", "H", "I"),
  list_column2 = list("A", "B", "C", "D", c(LETTERS[5:9]))
)

unnest_dt2(DT, list_column1, list_column2)
##     a b list_column1 list_column2
##  1: A F            A            A
##  2: A F            B            A
##  3: A F            C            A
##  4: A F            D            A
##  5: A F            E            A
##  6: B G            F            B
##  7: C H            G            C
##  8: D I            H            D
##  9: E J            I            E
## 10: E J            I            F
## 11: E J            I            G
## 12: E J            I            H
## 13: E J            I            I

  1. Source: this answer from @akrun: https://stackoverflow.com/a/40420690/5028841, which I think should be added to data.table’s documentation somewhere.

  2. See issue #751

R data.table

Related