Skip to content

Why I use data.table. Part 1

Why I use data.table. Part 1 published on No Comments on Why I use data.table. Part 1

Preface

The R package data.table showed up in my site-library in the summer of 2013. The problem I was working on at that time can be broadly described as a binary classification task, very similar to the fraud detection. It was supposed to be run overnight as a part of the data warehouse loading jobs. Modeling took me a while but I ended up with a surprisingly accurate model with a handful of meaningful predictors. So everything was looking great until I realized that my model would not scale. To achieve an acceptable predictive accuracy I constructed a rather complex set of features. The regular R data frames were just too slow to build them; the transformations needed to create features took a very long time to run even on moderately sized data. Looking for alternatives resulted in my discovery of data.table. Since then I routinely use data.table in all my work.

Basics

The speed advantage of data.table instead of, say, plain data.frame or dplyr package is well known (see for example data.table vs dplyr, A speed test comparison of plyr, data.table, and dplyr and Comparison Packages for Data Manipulation ).

Extremely fast data manipulations of data.table are achieved in part by doing all the operations on data.table objects “by reference”, i.e. by avoiding expensive in-memory copy. Keeping it in mind is useful. As an illustration, let’s say you want to copy a data.table object to a temporary one to play with it. The intuitive way of doing so would be to use the assignment operator and type in something like this:

> dt.temp <- dt # Create a temp copy of "dt" to play with.

Then you do something really stupid to dt.temp, eventually give up and want to forget the whole thing. Going back to dt ends up in a surprise. Everything you did to the poor little dt.temp is “replicated” in dt. You have just turned your working data set into a mess.

The name of a data.table object is in fact a pointer to the data structure sitting in the memory. Consequently, the assignment operator just creates another pointer without copying the object itself. How to create a “true” copy then? We need to explicitly call copy() on the data.table for that. Let me illustrate.

Get the data first

> library(data.table,quietly = TRUE) # load package
> # Load Car data set from UCI ML repository
> url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
> d <- read.table(url, sep = ',', header = F)

> class(d)
[1] "data.frame"

> dt <- data.table(d) # convert data.frame to data.table 
> class(dt)           # class "data.table" inherits class "data.frame" 
[1] "data.table" "data.frame"

> # Notice how data.table displays only the first five 
> # and the last five records. Compare to what happens
> # when you type in the name of a data.frame object 
> dt 
         V1    V2    V3   V4    V5   V6    V7
   1: vhigh vhigh     2    2 small  low unacc
   2: vhigh vhigh     2    2 small  med unacc
   3: vhigh vhigh     2    2 small high unacc
   4: vhigh vhigh     2    2   med  low unacc
   5: vhigh vhigh     2    2   med  med unacc
  ---                                        
1724:   low   low 5more more   med  med  good
1725:   low   low 5more more   med high vgood
1726:   low   low 5more more   big  low unacc
1727:   low   low 5more more   big  med  good
1728:   low   low 5more more   big high vgood
> 

Now let’s try the assignment operator "<-" and the function copy().

 
> tracemem(dt)
[1] "<0x0000000030c52e20>"
> dt1 <- dt      # assignment 
> tracemem(dt1)
[1] "<0x0000000030c52e20>"
> dt2 <- copy(dt) # explicit copy
tracemem[0x0000000030c52e20 -> 0x00000000102fedf8]: copy 
> tracemem(dt2)
[1] "<0x0000000030d0af18>"
> untracemem(dt)
> untracemem(dt1)
> untracemem(dt2)
>

Indeed, the addresses of dt and dt1 are the same whereas the address of dt2 is different.

By the way, the useful function tables() of the data.table package lists all data tables currently sitting in the memory. (Note that it requires RMySQL package to work... weird.)

> tables()
     NAME  NROW NCOL MB COLS                 KEY
[1,] dt   1,728    7  1 V1,V2,V3,V4,V5,V6,V7    
[2,] dt1  1,728    7  1 V1,V2,V3,V4,V5,V6,V7    
[3,] dt2  1,728    7  1 V1,V2,V3,V4,V5,V6,V7    
Total: 3MB
> 

Similarly to data.frame and matrix, functions nrow(), ncol() and dim() return dimensions of a data.table object. Functions names(), rownames(), colnames(), summary() and str() also work as expected, in the same way as they work for data frames.

It must be the magic of inheritance from data.frame that passing a data table to any function where 'data' parameter is of the data.frame type always works, be it lm(), glm(), gbm(), randomForest(), plot(), ggplot() and probably many others. There must be some exceptions, just by virtue of Murphy's Law, but I do not recall ever stumbling upon them yet (when in doubt, use data.frame(DT) to coerce data table DT to a data frame).

The familiar syntax for rows subsetting is valid, too

> dt[3:5,]
      V1    V2 V3 V4    V5   V6    V7
1: vhigh vhigh  2  2 small high unacc
2: vhigh vhigh  2  2   med  low unacc
3: vhigh vhigh  2  2   med  med unacc
>

Columns selection by position does not work as expected, however.

> dt[3:5,4:7]
[1] 4 5 6 7
>

To achieve the same effect of column subsetting as in data frames, use the syntax DT[,,with=FALSE]

> dt[3:5,4:7,with=FALSE]
   V4    V5   V6    V7
1:  2 small high unacc
2:  2   med  low unacc
3:  2   med  med unacc
>

The reasons why it's done this way (as well as reasons why referencing columns by position is generally a bad idea) are explained in the answer to the first question of data.table FAQ.

As if to compensate for this "inconvenience", column subsetting by column names does not require quotes:

> dt[3:5,list(V4, V5, V6, V7)]
   V4    V5   V6    V7
1:  2 small high unacc
2:  2   med  low unacc
3:  2   med  med unacc
>

Columns (or fields)

Columns names of the data table we created were set to their default values:

> names(dt)
[1] "V1" "V2" "V3" "V4" "V5" "V6" "V7"

In principle, there is nothing wrong in setting them to some meaningful values by using the assignment operator, i.e.

names(dt) <- LETTERS[1:7]

However, data.table sends us a big fat warning saying that this syntax copies the whole object, which is not good. Instead, we should use setnames() (see the fields names of the Cars data set at UCI ML data repository)

setnames(dt, c( "buying", "maint"
              , "doors", "persons"
              , "lug_boot", "safety","class") )  )

In addition to performance, setnames() allows you to replace only a subset of the column names.

Now, let's see what we got

> summary(dt)
   buying      maint       doors     persons     lug_boot    safety      class     
 high :432   high :432   2    :432   2   :576   big  :576   high:576   acc  : 384  
 low  :432   low  :432   3    :432   4   :576   med  :576   low :576   good :  69  
 med  :432   med  :432   4    :432   more:576   small:576   med :576   unacc:1210  
 vhigh:432   vhigh:432   5more:432                                     vgood:  65  
>

All fields are factors, which is not surprising because doors and persons have values of the character type ('5more' and 'more'). I want to remove these records from the data and have fields doors and persons as integers. The data table subsetting will be covered later, but for the moment I am just going to do it without further explanation. Notice, however, the elegance of this syntax.

> dt1 <- dt[doors != '5more' & persons != 'more', ]
> tables()
     NAME  NROW NCOL MB COLS                                             KEY
[1,] dt   1,728    7  1 buying,maint,doors,persons,lug_boot,safety,class    
[2,] dt1    864    7  1 buying,maint,doors,persons,lug_boot,safety,class    
[3,] dt2  1,728    7  1 V1,V2,V3,V4,V5,V6,V7                                
Total: 3MB
> 

The data table dt1 that used to be a pointer to dt itself is now overwritten by the result of this operation. Let's take a closer look

> summary(dt1)
   buying      maint       doors     persons     lug_boot    safety      class    
 high :216   high :216   2    :288   2   :432   big  :288   high:288   acc  :147  
 low  :216   low  :216   3    :288   4   :432   med  :288   low :288   good : 27  
 med  :216   med  :216   4    :288   more:  0   small:288   med :288   unacc:670  
 vhigh:216   vhigh:216   5more:  0  
>

OK, it looks alright, but I still have the values '5more' and 'more' shown, even though there are no records that contain them. This issue is a common source of frustration dealing with factors in R. We need to rebuild fields doors and presons. Since data.table is a subclass of data.frame, I suspect that its fields can be accessed directly by $ operator. So something like x <- factor(x) may work.

> dt1$doors <- factor(dt1$doors)
> dt1$persons <- factor(dt1$persons)

And it does:

> summary(dt1)
   buying      maint     doors   persons  lug_boot    safety      class    
 high :216   high :216   2:288   2:432   big  :288   high:288   acc  :147  
 low  :216   low  :216   3:288   4:432   med  :288   low :288   good : 27  
 med  :216   med  :216   4:288           small:288   med :288   unacc:670  
 vhigh:216   vhigh:216                                          vgood: 20
>

Encouraged by similarities with data.frame, let's finally convert dt1$doors and dt1$persons to integers using the same trick, i.e accessing data table's fields as if they were fields of a data frame.

> dt1$doors <- as.integer(as.character(d1$doors))
> dt1$persons <- as.integer(as.character(d1$persons))
> summary(dt1)
   buying      maint         doors      persons   lug_boot    safety      class    
 high :216   high :216   Min.   :2   Min.   :2   big  :288   high:288   acc  :147  
 low  :216   low  :216   1st Qu.:2   1st Qu.:2   med  :288   low :288   good : 27  
 med  :216   med  :216   Median :3   Median :3   small:288   med :288   unacc:670  
 vhigh:216   vhigh:216   Mean   :3   Mean   :3                          vgood: 20  
                         3rd Qu.:4   3rd Qu.:4                                     
                         Max.   :4   Max.   :4      
>

Now we have these two fields as integers. The conversion to characters of a factor levels is necessary for this operation to work. I covered my somewhat painful experience working with R factors in another post.

Finally, the story wouldn't be complete without function setcolorder() with no word "color" in the middle. This function sets the column order of the data table. The beauty of it is that passing column positions as numbers is acceptable. This is one of a few rare instances when referencing columns by their positions could be justified.

> names(dt1)
[1] "buying"   "maint"    "doors"    "persons"  "lug_boot" "safety"   "class"   
> setcolorder(dt1,neworder=ncol(dt1):1)
> names(dt1)
[1] "class"    "safety"   "lug_boot" "persons"  "doors"    "maint"    "buying"  
> 

Just keep in mind that the result is the same data table, but with the different column ordering. Both operations setcolorder() and setnames() described above work on data tables directly, by reference, making all the changes "in place". They do not return anything.

Key

The key is the second secret weapon of data.table speedy performance, in addition to the already discussed data access by reference. Key is a direct equivalent of an index of the database table. Like a regular database table, data.table supports composite key that includes more than one field. The command setkey() creates the key on specified columns and key() displays fields included in the key.

> setkey(dt,class,safety)
> key(dt)
[1] "class"  "safety"
> 

Creation of a key is like sorting. In other words, after the key is created, the data table dt becomes sorted by class first, and then by safety. You can type dt in the command line to check. Every operation that uses these fields for a lookup now benefits from them being sorted. Obviously, having a right key makes a huge difference for working with large data table objects. Sorting is always performed in ascending order.

I suspect that in reality these fields are not exactly "sorted", and nobody reshuffles the records in memory every time setkey() is issued. For all intents and purposes, however, setkey() works like sort() operation. I often use setkey() just to sort a data table, i.e. to put records in a certain order. Needless to say, the speed of sorting a data frame using order() and a data table using setkey() are very different. You can guess who is the winner.

Keys are also essential components of subsetting and merging (joining) of data tables, making these operations extremely fast. Every DBA knows that to join two tables efficiently, the fields used for the join should be indexed. That's exactly what setkey() does.

Oh, and I almost forgot to mention... to change the count of displayed records, use print() with the second parameter specifying how many records at the top and at the bottom of the data table to display. If you think that head() and tail() would work in the same way, you are correct.

> print(dt,3)
      buying maint doors persons lug_boot safety class
   1:  vhigh   med     2       4    small   high   acc
   2:  vhigh   med     2       4      med   high   acc
   3:  vhigh   med     2       4      big   high   acc
  ---                                                 
1726:    low   low 5more       4      big   high vgood
1727:    low   low 5more    more      med   high vgood
1728:    low   low 5more    more      big   high vgood
> head(dt,3)
   buying maint doors persons lug_boot safety class
1:  vhigh   med     2       4    small   high   acc
2:  vhigh   med     2       4      med   high   acc
3:  vhigh   med     2       4      big   high   acc
> tail(dt,3)
   buying maint doors persons lug_boot safety class
1:    low   low 5more       4      big   high vgood
2:    low   low 5more    more      med   high vgood
3:    low   low 5more    more      big   high vgood
> 

These are some reasons why I am using data.table for all my work. Just to recap, the speed, the intuitive ease of use (assuming you are at comfort with data.frame, of course), and the graceful fallback to the data.frame when needed, are three major advantages. There are other things about this package that I really like, and they will be covered in subsequent posts.

Vlad.R

Leave a Reply

Your email address will not be published. Required fields are marked *