Skip to content

Useful findInterval() function

Useful findInterval() function published on No Comments on Useful findInterval() function

Last week I worked on a seemingly simple, almost trivial problem – the mapping from IP addresses to country. Free services out there that return the full geographical location data given an IP address are well known. Some of them have API that could be called programmatically. Nothing out of ordinary; people were doing that for a long time. My problem was that I needed to do it fast and for a big data set, effectively adding a country to the stream of IP addresses coming from the online service. And I wanted to do it in R.

I downloaded the free IP-to-country data from maxmind.com and realized that first of all, I need to understand how IP addresses and network masks work. It took me a while but I ended up in a deep appreciation for the very smart design put together more than two decades ago.

The downside, of course, was that the bitmask manipulations, while are easily available in C, are not really what R is supposed to do. So my first implementation based on bitops package that replicated the bits operations found in C code was working, but it was too slow to keep up with the incoming stream. The problem was not the performance of the package itself, although there is always some room for improvements, but the way the algorithm works.

Each record given in the mapping file describes the block of IP addresses and the mask that represents the number of bits allocated to the block description. For example, the record 123.100.0.0/19 describes the IP block ranging from 123.100.0.0 to 123.100.31.255 because the mask 19 leaves 32 – 19 = 13 bits to describe “the hosts”. These addresses start with 123.100 and use 13 bits to distinguish themselves. Thirteen bits is 2^13 = 2^5 * 2^8 = 32 * 256, which gives us the ranges 0-31 and 0-255 if we recall that IP addresses are zero based. There are 32 * 256 = 8129 addresses in this block.

To find what IP block a given IP address belongs to, one needs to convert the address to its binary representation in the form of 32 bits. Then look through all IP blocks in their binary form and find the one that is equal to the results of ANDing the block’s bitmask with the binary representation of IP address.

You can envision it as a loop over IP blocks. Each iteration applies the mask of the current record and compares the result to the block itself. Once the match is found, we are done.

This mechanism works well for each individual address, but it breaks when there are many addresses to resolve. The double loop, one over the IP blocks, another over the IP addresses surely will not perform well. If I were in the database world, I would want to join my table with IP addresses to the table of IP blocks. The problem is that it is not a regular join, but something different. If the IP block are in the master table with the primary key “block IP” and the user IP is the child table with IP being the foreign key, then in order to join I would need the field “mask” from the master table applied to the IP in the child table and then join the result with the primary key in the master. No database can do that.

I know the solution; I saw it somewhere. Our database team came up with the idea to expand all IP blocks into individual IPs and then do the join. This is a pretty natural way any decent DBA should come up with after a short pondering. The master table would be much bigger since it would include all IP addresses available on the Internet, but it’s still conceivable, given enough CPU cycles and a fast storage (the records count: 2^32 = 4,294,967,296).

Not an option for me though.

All these experiments eventually led me to discover the R function findInterval(x, vec, ...). It does exactly what I needed to do. Its description says “Given a vector of non-decreasing breakpoints in vec, find the interval containing each element of x;“. The main idea is to store ranges of IP block addresses in vec and pass the vector of IP addresses to resolve as x. The returned values then will be cbinded to x and eventually to the country. It is not absolutely necessary, but seems more natural to convert the IP addresses to the their numeric form and operate on integers rather than strings.

One caveat: If the blocks have gaps between then, the IPs falling inside them will be matched with the end address of the preceding block. These unmatched records should be associated with the ‘Unknown’ country and then inspected for validity. For simplicity sake, the code below ignores these gaps assuming that the IP space present in the geoIP data is contiguous.

It is a downside of this simple approach, but hopefully, there are not too many gaps and the solution with findInterval(x, vec, ...) is adequate.

Utility function to convert IP address to its numeric form

convIP <- function(s)
{
  ## Checks are omitted 
  L <- strsplit(ip, split = '.', fixed=T)
  o <- do.call(rbind,L)  
  as.numeric(o[,4]) + 256   * as.numeric(o[,3]) 
                    + 256^2 * as.numeric(o[,2]) 
                    + 256^3 * as.numeric(o[,1])
}

Download and extract data files by URL

library(data.table)

dir <- 'GeoLite2-Country-CSV_20151103'
locationsfile <- 'GeoLite2-Country-Locations-en.csv'
blocksfile <- 'GeoLite2-Country-Blocks-IPv4.csv'

tempFile <- tempfile()
download.file("http://geolite.maxmind.com/download/geoip/database/GeoLite2-Country-CSV.zip",tempFile)
unzip(zipfile = tempFile, c(paste0(dir,'/',locationsfile), paste0(dir,'/',blocksfile)), junkpaths=T)
unlink(tempFile)

locations <- fread(locationsfile,sep=',')
blocks <- fread(blocksfile,sep=',')[,1:2,with=F]

unlink(locationsfile)
unlink(blocksfile)

## Merge locations and blocks
geoIP <- merge(blocks,locations,by="geoname_id")

## Get blocks' first address. Notice that strsplit() returns the list.    
geoIP$ip0 <- do.call(rbind, strsplit(geoIP$network, split='/', fixed=T))[,1]

Generate some fake IP addresses

IPs <- data.table(do.call(rbind,
                lapply(1:10000, 
                  function(x){  
                    paste(ceiling(runif(4)*254) ## ceiling() to make sure addresses don't start with 0
                         , collapse='.')})))
setnames(IPs,"ip")

Add numeric representations

geoIP[, IPnum := convIP(ip0)]
IPs[, IPnum := convIP(ip)]
setkey(geoIP,IPnum)
setkey(IPs,IPnum)

Leave a Reply

Your email address will not be published. Required fields are marked *