Skip to content

Why I use data.table. Part 1

Why I use data.table. Part 1 published on No Comments on Why I use data.table. Part 1

Preface

The R package data.table showed up in my site-library in the summer of 2013. The problem I was working on at that time can be broadly described as a binary classification task, very similar to the fraud detection. It was supposed to be run overnight as a part of the data warehouse loading jobs. Modeling took me a while but I ended up with a surprisingly accurate model with a handful of meaningful predictors. So everything was looking great until I realized that my model would not scale. To achieve an acceptable predictive accuracy I constructed a rather complex set of features. The regular R data frames were just too slow to build them; the transformations needed to create features took a very long time to run even on moderately sized data. Looking for alternatives resulted in my discovery of data.table. Since then I routinely use data.table in all my work.Continue reading Why I use data.table. Part 1

Just give me the average!

Just give me the average! published on No Comments on Just give me the average!

When you hear this phrase from your boss, just give her the average. You should stop pushing for the “right” way to measure an important business metric. Do not try to continue and convince her that averages are misleading, do not tell scary stories about averages you read in textbooks, do not say that another metric is a better choice, that there are confounders we still need to account for, and many other things that you think you know better than her. She would not hear anyway. Just give her the average and consider it done. Continue reading Just give me the average!

Playing with factors

Playing with factors published on No Comments on Playing with factors

I have always struggled with R factors. What are they? How to manipulate them? More importantly, how to think about them? Today I finally sat down and spent a few hours playing around trying to understand them better. The results are below.

Verdict: dangerous Continue reading Playing with factors