I was aware of this problem for a long time, but always managed to circumvent it somehow. It popped up again last week. I couldn’t dodge it because another, and more important problem required a solution to this old issue. It was time to tackle it head-on. The topic of today’s post is the definition of a user session.

User session is typically described by resorting to the concept of a web session. The simplest and commonly accepted definition of a web session is based on the assumption that after 30 minutes of inactivity user starts a new session. The 30 minutes is a de facto standard, but as far as I understand it is completely arbitrary and not supported by any relevant studies. Why 30? Why not 45 or 15?

Depending on specific needs of a web site this inactivity time interval can be adjusted. However I am not aware of any guidelines to help choosing the “right” value for it.

Web sessions are tracked by the use of session cookies. They are also conveniently used as an identification/authentication token when security is a concern. This double usage – to track the user activity and to serve authentication purposes poses a bit of a challenge.

The problem is that logging a user out because of her inactivity erases her session cookie, effectively ending the session. Even if she logs in back immediately after the session expiration, all her activities since the new login become associated with the new session. The user sessions therefore are fully determined by her authentication sessions.

Not a big deal though. The big deal for me was the session definition as a measure of user activity. Longer sessions are naturally associated with better engagement and more user satisfaction with the content we offer.

But how to define the session? I don’t want it to be tied to the authentication (in fact, all our sessions are secure) or to the arbitrarily chosen 30 minutes interval between consequent event generated by the same user. I want my session definition be based on the data I observe in the user population. Simply put, to be data driven.

After some pondering a sensible approach emerged. I illustrate it below by application to the DOBBS data set taken from Online Browsing Behavior Study. Only file dobbsSessionEvents.dat is used.

### Main idea

I look at the various values of the “inactivity” parameter (session_break_sec in the code below) and select the value that results in the minimal variance of the session length across all players.

Reduction of variance is a sensible way to take into account what the data represent. There is a catch though. As the allowed interval between user activities increases, the sessions are obviously become longer, but at the same time the session length variance becomes larger too. One approach to address this difficulty is to use the coefficient of variation defined as a ratio of standard deviation to the mean ($CV=\mu/\sigma$) and look for its minimum. The inactivity interval that minimized the coefficient of variation is the value I use for the session definition.

To make things more reliable I also compute a robust analog of the coefficient of variation called quartile coefficient of dispersion (QCD) and defined as $QCD=\frac{Q_3-Q_1}{Q_3+Q_1}$. Here $Q_1$ and $Q_3$ are the first and the third quartiles of the data set. See corresponding Wikipedia articles for details on CV and QCD

### Results

The following chart shows (normalized) CV and QCD of the session length as functions of the inactivity interval in seconds.

The values minimizing CV and QCD of the session length are correspondingly 108 and 175 seconds. These numbers are significantly less than 30 minutes commonly accepted in practice. Which one to choose at the end is a matter of taste or perhaps other considerations.

I selected the results suggested by QCD; I usually prefer robust methods.

### The code

The code computes CV and QCD for inactivity interval value between 1 and 3600 seconds. The function create_sessions() computes sessions given inactivity interval as a parameter. I rely heavily on the properties of data.table objects here. The plot and optimal values of the inactivity interval conclude the computations.

It’s a lot to look into, and I do use some tricks that may look obscure. Still I hope it can be useful.

library(data.table)

dataFile <- "dobbsSessionEvents.dat"
tempFile <- tempfile()
unzip(zipfile = tempFile)

## Read file, select time and user_id. Convert time to POSIX time and set the key
tSet <- fread(dataFile, colClasses = "character" )[, list(V1, V3)]
tSet[, timestamp :=  as.POSIXct(strptime(V1, "%Y%m%d%H%M%S") ) ]
tSet[, V1 := NULL]
setnames(tSet, c("user", "timestamp"))
setkey(tSet, user, timestamp )  # Order by user and timestamp

## delete .dat file file if needed

## create additional fields, for each user.
options(warn=-1)   #  Ignore warnings from diff()
tSet[,:=(  page_view_no = 1:.N                         # page view number
, page_view_count = .N                      # total page views
, unix_time = as.numeric(timestamp)         # UNIX time
, time_diff = as.numeric(diff(timestamp) )  # seconds to the next page view
), by = user]
options(warn=0)

## set time_diff = NA for the last record
tSet[page_view_no == page_view_count, time_diff := NA]

##  Function to create sessions
create_sessions  <- function(data, session_break_sec = 30*60)
{
## Parameters:
##    - data table
##    - inactivity interval in session definition, seconds

## Utility function to generate session_id in increasing order
session_ID <- function(d)
{
if(length(d) == 1)
return(d)
else {
d0 <- d[d != 0]
d1 <- c(d0[1], diff(d0))
d2 <- 1:length(d1)
rep(d2, d1)
}
}

# Create fields
data[, session_start := 0L]
data[, session_end := 0L]

# Mark the session end
data[page_view_count  == page_view_no, session_end := page_view_no]
data[time_diff  > session_break_sec, session_end := page_view_no, by=user]

options(warn=-1)  # Suppress warnings
# Infer the session start
data[, session_start := c(0L, 1L+.SD$session_end[1:(.N-1)]) , by=user] data[session_start == 1, session_start := 0L ] data[page_view_no == 1, session_start := 1L ] # Create session_id data[, session_id := session_ID(session_end), by=user] options(warn=0) data } ## compute CV and QCD of session duration for different values of the inactivity interval CVdata <- data.table( do.call( rbind , lapply(1:3600, function(x) { d <- create_sessions(tSet, session_break_sec = x) sessions <- d[session_end == 0 , list(session_duration = sum(time_diff) ) , by = list(user, session_id)] s_q <- sessions[,list( q25 = quantile(session_duration, 0.25) , q75 = quantile(session_duration, 0.75))] sessions[, list(session_mean = mean(session_duration) , session_sd = sd(session_duration) , cv = sd(session_duration)/mean(session_duration) , qcd = (s_q$q75 - s_q$q25)/(s_q$q75 + s_q$q25) , session_break_sec = x ) ] } ) ) ) ## Normailze to [0, 1] CVdata[,cv := (cv - min(cv))/(max(cv) - min(cv))] ## Plotting plot(x = CVdata$session_break_sec, y = CVdata$cv,type="b", pch = 19, cex = 0.5, col = "red", xlab = "Inactivity interval, sec", ylab = "QCD, CV (normalized)", main = "CV and QCD vs inactivity interval") lines(x = CVdata$session_break_sec, y = CVdata\$qcd, type = "b", pch = 19,
cex = 0.5, col = "blue")

# Find inactivity intervals that minimize CV and QCD of the session length
CVdata[which.min(CVdata[, cv]), , with = T]
#    session_mean session_sd cv      qcd session_break_sec
# 1:     84.08016   91.10373  0 0.785124               108
CVdata[which.min(CVdata[, qcd]), , with = T]
#    session_mean session_sd         cv       qcd session_break_sec
# 1:     169.2424   193.1865 0.04188886 0.5902778               175