The R package data.table showed up in my site-library in the summer of 2013. The problem I was working on at that time can be broadly described as a binary classification task, very similar to the fraud detection. It was supposed to be run overnight as a part of the data warehouse loading jobs. Modeling took me a while but I ended up with a surprisingly accurate model with a handful of meaningful predictors. So everything was looking great until I realized that my model would not scale. To achieve an acceptable predictive accuracy I constructed a rather complex set of features. The regular R data frames were just too slow to build them; the transformations needed to create features took a very long time to run even on moderately sized data. Looking for alternatives resulted in my discovery of data.table. Since then I routinely use data.table in all my work.
The speed advantage of data.table instead of, say, plain data.frame or dplyr package is well known (see for example data.table vs dplyr, A speed test comparison of plyr, data.table, and dplyr and Comparison Packages for Data Manipulation ).
Extremely fast data manipulations of data.table are achieved in part by doing all the operations on data.table objects “by reference”, i.e. by avoiding expensive in-memory copy. Keeping it in mind is useful. As an illustration, let’s say you want to copy a data.table object to a temporary one to play with it. The intuitive way of doing so would be to use the assignment operator and type in something like this:
> dt.temp <- dt # Create a temp copy of "dt" to play with.
Then you do something really stupid to dt.temp, eventually give up and want to forget the whole thing. Going back to dt ends up in a surprise. Everything you did to the poor little dt.temp is “replicated” in dt. You have just turned your working data set into a mess.
The name of a data.table object is in fact a pointer to the data structure sitting in the memory. Consequently, the assignment operator just creates another pointer without copying the object itself. How to create a “true” copy then? We need to explicitly call
copy() on the data.table for that. Let me illustrate.
Get the data first
> library(data.table,quietly = TRUE) # load package > # Load Car data set from UCI ML repository > url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data" > d <- read.table(url, sep = ',', header = F) > class(d)  "data.frame" > dt <- data.table(d) # convert data.frame to data.table > class(dt) # class "data.table" inherits class "data.frame"  "data.table" "data.frame" > # Notice how data.table displays only the first five > # and the last five records. Compare to what happens > # when you type in the name of a data.frame object > dt V1 V2 V3 V4 V5 V6 V7 1: vhigh vhigh 2 2 small low unacc 2: vhigh vhigh 2 2 small med unacc 3: vhigh vhigh 2 2 small high unacc 4: vhigh vhigh 2 2 med low unacc 5: vhigh vhigh 2 2 med med unacc --- 1724: low low 5more more med med good 1725: low low 5more more med high vgood 1726: low low 5more more big low unacc 1727: low low 5more more big med good 1728: low low 5more more big high vgood >
Now let’s try the assignment operator
"<-" and the function
> tracemem(dt)  "<0x0000000030c52e20>" > dt1 <- dt # assignment > tracemem(dt1)  "<0x0000000030c52e20>" > dt2 <- copy(dt) # explicit copy tracemem[0x0000000030c52e20 -> 0x00000000102fedf8]: copy > tracemem(dt2)  "<0x0000000030d0af18>" > untracemem(dt) > untracemem(dt1) > untracemem(dt2) >
Indeed, the addresses of dt and dt1 are the same whereas the address of dt2 is different.
By the way, the useful function
tables() of the data.table package lists all data tables currently sitting in the memory. (Note that it requires RMySQL package to work... weird.)
> tables() NAME NROW NCOL MB COLS KEY [1,] dt 1,728 7 1 V1,V2,V3,V4,V5,V6,V7 [2,] dt1 1,728 7 1 V1,V2,V3,V4,V5,V6,V7 [3,] dt2 1,728 7 1 V1,V2,V3,V4,V5,V6,V7 Total: 3MB >
Similarly to data.frame and matrix, functions
dim() return dimensions of a data.table object. Functions
str() also work as expected, in the same way as they work for data frames.
It must be the magic of inheritance from data.frame that passing a data table to any function where 'data' parameter is of the data.frame type always works, be it
ggplot() and probably many others. There must be some exceptions, just by virtue of Murphy's Law, but I do not recall ever stumbling upon them yet (when in doubt, use
data.frame(DT) to coerce data table
DT to a data frame).
The familiar syntax for rows subsetting is valid, too
> dt[3:5,] V1 V2 V3 V4 V5 V6 V7 1: vhigh vhigh 2 2 small high unacc 2: vhigh vhigh 2 2 med low unacc 3: vhigh vhigh 2 2 med med unacc >
Columns selection by position does not work as expected, however.
> dt[3:5,4:7]  4 5 6 7 >
To achieve the same effect of column subsetting as in data frames, use the syntax
> dt[3:5,4:7,with=FALSE] V4 V5 V6 V7 1: 2 small high unacc 2: 2 med low unacc 3: 2 med med unacc >
The reasons why it's done this way (as well as reasons why referencing columns by position is generally a bad idea) are explained in the answer to the first question of data.table FAQ.
As if to compensate for this "inconvenience", column subsetting by column names does not require quotes:
> dt[3:5,list(V4, V5, V6, V7)] V4 V5 V6 V7 1: 2 small high unacc 2: 2 med low unacc 3: 2 med med unacc >
Columns (or fields)
Columns names of the data table we created were set to their default values:
> names(dt)  "V1" "V2" "V3" "V4" "V5" "V6" "V7"
In principle, there is nothing wrong in setting them to some meaningful values by using the assignment operator, i.e.
names(dt) <- LETTERS[1:7]
However, data.table sends us a big fat warning saying that this syntax copies the whole object, which is not good. Instead, we should use
setnames() (see the fields names of the Cars data set at UCI ML data repository)
setnames(dt, c( "buying", "maint" , "doors", "persons" , "lug_boot", "safety","class") ) )
In addition to performance,
setnames() allows you to replace only a subset of the column names.
Now, let's see what we got
> summary(dt) buying maint doors persons lug_boot safety class high :432 high :432 2 :432 2 :576 big :576 high:576 acc : 384 low :432 low :432 3 :432 4 :576 med :576 low :576 good : 69 med :432 med :432 4 :432 more:576 small:576 med :576 unacc:1210 vhigh:432 vhigh:432 5more:432 vgood: 65 >
All fields are factors, which is not surprising because doors and persons have values of the character type ('5more' and 'more'). I want to remove these records from the data and have fields doors and persons as integers. The data table subsetting will be covered later, but for the moment I am just going to do it without further explanation. Notice, however, the elegance of this syntax.
> dt1 <- dt[doors != '5more' & persons != 'more', ] > tables() NAME NROW NCOL MB COLS KEY [1,] dt 1,728 7 1 buying,maint,doors,persons,lug_boot,safety,class [2,] dt1 864 7 1 buying,maint,doors,persons,lug_boot,safety,class [3,] dt2 1,728 7 1 V1,V2,V3,V4,V5,V6,V7 Total: 3MB >
The data table
dt1 that used to be a pointer to
dt itself is now overwritten by the result of this operation. Let's take a closer look
> summary(dt1) buying maint doors persons lug_boot safety class high :216 high :216 2 :288 2 :432 big :288 high:288 acc :147 low :216 low :216 3 :288 4 :432 med :288 low :288 good : 27 med :216 med :216 4 :288 more: 0 small:288 med :288 unacc:670 vhigh:216 vhigh:216 5more: 0 >
OK, it looks alright, but I still have the values '5more' and 'more' shown, even though there are no records that contain them. This issue is a common source of frustration dealing with factors in R. We need to rebuild fields doors and presons. Since data.table is a subclass of data.frame, I suspect that its fields can be accessed directly by
$ operator. So something like
x <- factor(x) may work.
> dt1$doors <- factor(dt1$doors) > dt1$persons <- factor(dt1$persons)
And it does:
> summary(dt1) buying maint doors persons lug_boot safety class high :216 high :216 2:288 2:432 big :288 high:288 acc :147 low :216 low :216 3:288 4:432 med :288 low :288 good : 27 med :216 med :216 4:288 small:288 med :288 unacc:670 vhigh:216 vhigh:216 vgood: 20 >
Encouraged by similarities with data.frame, let's finally convert dt1$doors and dt1$persons to integers using the same trick, i.e accessing data table's fields as if they were fields of a data frame.
> dt1$doors <- as.integer(as.character(d1$doors)) > dt1$persons <- as.integer(as.character(d1$persons)) > summary(dt1) buying maint doors persons lug_boot safety class high :216 high :216 Min. :2 Min. :2 big :288 high:288 acc :147 low :216 low :216 1st Qu.:2 1st Qu.:2 med :288 low :288 good : 27 med :216 med :216 Median :3 Median :3 small:288 med :288 unacc:670 vhigh:216 vhigh:216 Mean :3 Mean :3 vgood: 20 3rd Qu.:4 3rd Qu.:4 Max. :4 Max. :4 >
Now we have these two fields as integers. The conversion to characters of a factor levels is necessary for this operation to work. I covered my somewhat painful experience working with R factors in another post.
Finally, the story wouldn't be complete without function
setcolorder() with no word "color" in the middle. This function sets the column order of the data table. The beauty of it is that passing column positions as numbers is acceptable. This is one of a few rare instances when referencing columns by their positions could be justified.
> names(dt1)  "buying" "maint" "doors" "persons" "lug_boot" "safety" "class" > setcolorder(dt1,neworder=ncol(dt1):1) > names(dt1)  "class" "safety" "lug_boot" "persons" "doors" "maint" "buying" >
Just keep in mind that the result is the same data table, but with the different column ordering. Both operations
setnames() described above work on data tables directly, by reference, making all the changes "in place". They do not return anything.
The key is the second secret weapon of data.table speedy performance, in addition to the already discussed data access by reference. Key is a direct equivalent of an index of the database table. Like a regular database table, data.table supports composite key that includes more than one field. The command
setkey() creates the key on specified columns and
key() displays fields included in the key.
> setkey(dt,class,safety) > key(dt)  "class" "safety" >
Creation of a key is like sorting. In other words, after the key is created, the data table dt becomes sorted by class first, and then by safety. You can type
dt in the command line to check. Every operation that uses these fields for a lookup now benefits from them being sorted. Obviously, having a right key makes a huge difference for working with large data table objects. Sorting is always performed in ascending order.
I suspect that in reality these fields are not exactly "sorted", and nobody reshuffles the records in memory every time
setkey() is issued. For all intents and purposes, however,
setkey() works like
sort() operation. I often use
setkey() just to sort a data table, i.e. to put records in a certain order. Needless to say, the speed of sorting a data frame using
order() and a data table using
setkey() are very different. You can guess who is the winner.
Keys are also essential components of subsetting and merging (joining) of data tables, making these operations extremely fast. Every DBA knows that to join two tables efficiently, the fields used for the join should be indexed. That's exactly what
Oh, and I almost forgot to mention... to change the count of displayed records, use
print() with the second parameter specifying how many records at the top and at the bottom of the data table to display. If you think that
tail() would work in the same way, you are correct.
> print(dt,3) buying maint doors persons lug_boot safety class 1: vhigh med 2 4 small high acc 2: vhigh med 2 4 med high acc 3: vhigh med 2 4 big high acc --- 1726: low low 5more 4 big high vgood 1727: low low 5more more med high vgood 1728: low low 5more more big high vgood > head(dt,3) buying maint doors persons lug_boot safety class 1: vhigh med 2 4 small high acc 2: vhigh med 2 4 med high acc 3: vhigh med 2 4 big high acc > tail(dt,3) buying maint doors persons lug_boot safety class 1: low low 5more 4 big high vgood 2: low low 5more more med high vgood 3: low low 5more more big high vgood >
These are some reasons why I am using data.table for all my work. Just to recap, the speed, the intuitive ease of use (assuming you are at comfort with data.frame, of course), and the graceful fallback to the data.frame when needed, are three major advantages. There are other things about this package that I really like, and they will be covered in subsequent posts.