Skip to content

Useful findInterval() function

Useful findInterval() function published on No Comments on Useful findInterval() function

Last week I worked on a seemingly simple, almost trivial problem – the mapping from IP addresses to country. Free services out there that return the full geographical location data given an IP address are well known. Some of them have API that could be called programmatically. Nothing out of ordinary; people were doing that for a long time. My problem was that I needed to do it fast and for a big data set, effectively adding a country to the stream of IP addresses coming from the online service. And I wanted to do it in R.

Continue reading Useful findInterval() function

User session – what is it?

User session – what is it? published on No Comments on User session – what is it?

I was aware of this problem for a long time, but always managed to circumvent it somehow. It popped up again last week. I couldn’t dodge it because another, and more important problem required a solution to this old issue. It was time to tackle it head-on. The topic of today’s post is the definition of a user session. Continue reading User session – what is it?

Why I use data.table. Part 1

Why I use data.table. Part 1 published on No Comments on Why I use data.table. Part 1


The R package data.table showed up in my site-library in the summer of 2013. The problem I was working on at that time can be broadly described as a binary classification task, very similar to the fraud detection. It was supposed to be run overnight as a part of the data warehouse loading jobs. Modeling took me a while but I ended up with a surprisingly accurate model with a handful of meaningful predictors. So everything was looking great until I realized that my model would not scale. To achieve an acceptable predictive accuracy I constructed a rather complex set of features. The regular R data frames were just too slow to build them; the transformations needed to create features took a very long time to run even on moderately sized data. Looking for alternatives resulted in my discovery of data.table. Since then I routinely use data.table in all my work.Continue reading Why I use data.table. Part 1

Just give me the average!

Just give me the average! published on No Comments on Just give me the average!

When you hear this phrase from your boss, just give her the average. You should stop pushing for the “right” way to measure an important business metric. Do not try to continue and convince her that averages are misleading, do not tell scary stories about averages you read in textbooks, do not say that another metric is a better choice, that there are confounders we still need to account for, and many other things that you think you know better than her. She would not hear anyway. Just give her the average and consider it done. Continue reading Just give me the average!

Playing with factors

Playing with factors published on No Comments on Playing with factors

I have always struggled with R factors. What are they? How to manipulate them? More importantly, how to think about them? Today I finally sat down and spent a few hours playing around trying to understand them better. The results are below.

Verdict: dangerous Continue reading Playing with factors