Last week I worked on a seemingly simple, almost trivial problem – the mapping from IP addresses to country. Free services out there that return the full geographical location data given an IP address are well known. Some of them have API that could be called programmatically. Nothing out of ordinary; people were doing that for a long time. My problem was that I needed to do it fast and for a big data set, effectively adding a country to the stream of IP addresses coming from the online service. And I wanted to do it in R.
I was aware of this problem for a long time, but always managed to circumvent it somehow. It popped up again last week. I couldn’t dodge it because another, and more important problem required a solution to this old issue. It was time to tackle it head-on. The topic of today’s post is the definition of a user session. Continue reading User session – what is it?
The R package data.table showed up in my site-library in the summer of 2013. The problem I was working on at that time can be broadly described as a binary classification task, very similar to the fraud detection. It was supposed to be run overnight as a part of the data warehouse loading jobs. Modeling took me a while but I ended up with a surprisingly accurate model with a handful of meaningful predictors. So everything was looking great until I realized that my model would not scale. To achieve an acceptable predictive accuracy I constructed a rather complex set of features. The regular R data frames were just too slow to build them; the transformations needed to create features took a very long time to run even on moderately sized data. Looking for alternatives resulted in my discovery of data.table. Since then I routinely use data.table in all my work.Continue reading Why I use data.table. Part 1