Skip to content

Just give me the average!

Just give me the average! published on No Comments on Just give me the average!

When you hear this phrase from your boss, just give her the average. You should stop pushing for the “right” way to measure an important business metric. Do not try to continue and convince her that averages are misleading, do not tell scary stories about averages you read in textbooks, do not say that another metric is a better choice, that there are confounders we still need to account for, and many other things that you think you know better than her. She would not hear anyway. Just give her the average and consider it done.

A captain steering the ship through the darkness of stormy night does not need to know all little details of how GPS is working. Our ancestors became successful sailors by relying on stars fixed on the surface of a celestial sphere and the hard-earned knowledge of winds and torrents of the sea. This is the picture that your boss has in her head – a captain sailing in the open sea relying on her awareness, courage and intuition. Don’t burst her bubble by telling that the Earth is not flat and some stars are actually much bigger than the Sun. She has been using averages for the whole her career and they worked all the time. There is no need for anything else.

And this is OK. At the end, the use of averages and the 80/20 rule have been and still are widely used to determine the company’s future. Indeed, they seem to have worked almost every time in the past.

On another hand, I know someone who argues that each request for averages coming from the top of the corporate ladder has to be answered by seven numbers: means, standard deviation, minimum, maximum, median, the lower and the higher quartiles (i.e. 25th and 75th percentiles). In some cases, inclusion of the histogram and the kernel density estimate with the complete specification of the kernel function are also desirable. I am not kidding. I respect the guy, but I disagree with his approach. It’s an overkill for sure and hardly adds any value. Quite the opposite, it obscures the captain’s vision by a clutter of utterly unnecessary details.

In this post I am going to discuss a few methods I found useful in answering questions about “the average”. This is not to say that I actually report each of them back to the boss; I use them to draw my own conclusions and to be more confident in my analyses.

Nasty outliers

By definition, outliers are observations that are poorly described by the model. The model can be of different form. Most often it is a distribution used to describe the data. As such, whether a particular observation is treated as an outlier depends on the model of choice, or ultimately on the point of view. Consider two ways to display the same data produced by the following script

> set.seed(123)
> s0 <- c(rgamma(2000, 4,5), rgamma(3000,4,1) )
> par(mfrow=c(2,2))
> boxplot(s0,main="boxplot of s0")
> hist(s0)
> boxplot(log1p(s0),main="Boxplot of log1p(s0)")
> hist(log1p(s0))
> par(mfrow=c(1,1))

chart-with-and-without-outliers

The boxplot in the first row shows some outliers in the upper tail of the sample. The boxplot in the second row displays the same data on the logarithmic scale. It does not show any outliers at all. The histogram confirms that log1p(s0) is nicely confined to the small interval [0, 3] and it also correctly suggests that the distribution is bimodal – this fact is difficult to spot on the charts shown in the first row.

But these are not the outliers I am looking for…

What I mean by the term “nasty outliers” are observations that are most likely data errors or just impossible things. Like negative values for counts. Or web sessions lasting unreasonable amount of minutes for anybody to browse the web. In the latter case however, it’s conceivable that these extremely long sessions were generated by some type of robot or a crawler. They could be of practical interest in their own right, but should be treated as nasty outliers in the context of real people’s activity on the web site.

The problem with nasty outliers is that even a single one can push the average far away from the real mean of the sample.

> mean(c(s0))
[1] 2.722553
> mean(c(s0, -50000))
[1] -7.275992
> 

Median

One way to reduce the influence of any type of outliers on the distribution mean estimate is to change the point of view. The boss is asking for the average because she is looking for some values that are “correct on the average”. In fact, she is looking for a number that is “sitting in the middle”, i.e. for a typical representative point given all the data. This is exactly how the sample median is defined. We order all values in the increasing order and pick up the middle one as “typical”. Notice that the median is not an estimate of the distribution mean, it is just a number “sitting in the middle”. In statistical parlance, both the median and the sample average are estimates of the distribution location. They say that the median is a robust estimate of the location, meaning that no matter what is going on in the tails (i.e where all outliers are), the median is the same. In other words, the median is a stable estimate, not sensitive to outliers and other data properties that do not change the location.

Here it is:

> median(s0)
[1] 2.110506
> median(c(s0,-50000))
[1] 2.108144
> median(c(s0,-500000000))
[1] 2.108144
> median(c(s0,500000000))
[1] 2.112868
> 

Note: Interestingly, both the mean and the median of a sample are solutions to some optimization problems:

Trimmed mean

There is another simple way to reduce the influence of tails – to ignore them. Knowing that the bad stuff that draws the mean up or down is in the tails, let’s simply get rid of them. This approach becomes not as simple at a closer look and requires quite a bit of math to find out how much to cut from both ends and how to confirm the effectiveness of such procedures. Long story short, the commonly used amount of cutting is 20% from each end of the sample. The common term for this operation is “trimming”, so we are dealing here with the so-called trimmed mean. It is widely used and there is no coincidence that the mean() function in R takes the second parameter specifying how much to trim (from each side of the sample).

> mean(s0,trim=.2)
[1] 2.259698
> mean(c(s0, -50000),trim=.2)
[1] 2.254461
> 

Yes, that’s right. With the value trim=0.2 only middle 60% of the sample is used, i.e. almost a half of the data is entirely ignored. No need to worry though, 20% trimming is commonly accepted practice and there is a deep theory that supports this choice.

Notice also that mean(x, trim=0) = mean(x), i.e. trim=0 means no trimming at all, and mean(x, trim=0.5) is equal to median(x).

Confidence interval

A really discouraging and confusing fact is that all our estimates so far are different. Which one is “right”? Even worse, different amounts of trimming give different estimates. It’s hard to have confidence in our results because they all are telling different stories.

One way to address this discomfort is to realize that any estimate taken by itself is meaningless. Let’s say it 42. Now what? The point is that any number only makes sense in a given context. For example, when we want to compare two data sets we compute two estimates and evaluate the magnitude of their difference. It follows that as long as we use the same estimation method for both data sets, our comparison have a chance to be valid (there are more to say about comparisons, but I’ll stop here).

Another way to deal with uncertainty is to quantify it. We build more confidence in our results by knowing how confident we are. It sounds weird, but this is exactly how it works. We are going to discuss confidence intervals that covers the “true” value of the man, which is one form to quantify the confidence… or uncertainty. There are several ways of doing that and unfortunately exact meaning of these results is not quite intuitive (as opposed to the similar concept, credible interval that enjoys a straightforward interpretation). Without going into too much fine details, I am going to explain how to use the bootstrapping technique to compute a confidence interval for the distribution mean. I use bootstrapping a lot because its use and validity do not depend on strong distributional assumptions and its applicability is much wider than estimates of confidence intervals for the sample mean.

Warning: Bootsrtapping is a powerful weapon, and it is not a silver bullet. It is very easy to use the bootstrap incorrectly and get invalid results. Make sure you know what you are doing before writing the code.

Bootstrapping for confidence intervals

The idea of bootstrap consists of repeated resampling with replacement from the data and collecting the statistic of interest. For the estimate of the mean the following script creates a new data set sample.boot of size B=10000. Each point of this set is the mean of one resampled version of s0 created by the function sample() with parameter replace=TRUE.

> B <- 10000 ## How many bootstrap samples
> sample.boot <-  sapply(1:B, FUN=function(x) { mean(sample(s0, replace=TRUE) ) } )   
> hist(sample.boot)
> 

hist-bootstrap-mean

Taking mean of sample.boot does not really give us much. The real advantage of the method is that each observation in sample.boot is the mean of a certain data set just slightly different from s0. This “slight difference” is the key component of bootstrapping. It provides just enough variability to make solid statistical inference about the mean of the population represented by the sample s0.

More precisely, and switching to the statistical terminology, the set s0 is a sample drawn from the bigger population with the unknown mean M. We do not know M and want to think of it as if it was a random variable. In this interpretation, sample.boot is a sample drawn from the distribution of M. Now the 95% confidence interval for M can be estimated as 2.5 and 97.5 percentiles of sample.boot.

> quantile(sample.boot, c(0.025,0.975))
    2.5%    97.5% 
2.476150 2.847209 
> 

Therefore the estimate of 95% confidence interval for M is [2.476, 2.847].

The R package boot has everything you would need for bootstrapping. It’s worth exploring further, but for now I will just replicate the same calculation using functions boot and boot.ci

> library(boot)
> sample.mean <- function(d, i) mean(d[i])
> sample.boot <- boot(s0,sample.mean,R=B) 
> boot.ci(sample.boot)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = sample.boot)

Intervals : 
Level      Normal              Basic         
95%   ( 2.472,  2.853 )   ( 2.468,  2.848 )  

Level     Percentile            BCa          
95%   ( 2.476,  2.856 )   ( 2.482,  2.863 )  
Calculations and Intervals on Original Scale
Warning message:
In boot.ci(sample.boot) :
  bootstrap variances needed for studentized intervals
> 

Multimodality and mixtures

In some cases averages just do not make any sense. For example, as an old joke goes, an average person should have a single boob and one testicle. Albeit being “the average”, meeting such a person is clearly not a “common” event. The problem with averages illustrated by this example stems from failing to take into account the heterogeneity, or simply ignoring the fact that the human population is a mixture of males and females. These two groups have different characteristics and taking averages, although mathematically sound, provide little practical value.

The same situation happens for continuous variables. For example, the sample mean of a bimodal distribution shown below is 0.5013. Although this number does represent the “average”, it clearly is not representative of the data.

> set.seed(121) 
> s1  <-  c( rnorm(3000,-0.5,0.2) , rnorm(3000,1.5,0.6) )
> hist (s1 ,breaks=40,main="Example of bimodal distribution",xlab="data")
>

hist-bimodal-distribution

When a distribution clearly has more than a single mode, it’s always beneficial to try and understand the source of this multimodality. Then the data components can be separated and studied individually, like in the case of males and females. When it is not possible, mixture modeling comes to help. These types of models are useful in many other situations as well because of their universality. The main goal is to represent a complex distribution as a (finite) mixture of simpler ones. The most popular model is the Gaussian mixture, i.e the model that builds a representation of data in the form of a few Gaussian components mixed together in some proportions. The real life data are never Gaussian, and their representation in this form offers a tremendous advantage. An alternative way to look at mixtures is to interpret them as finding clusters in data. The R package I typically resort to is Rmixmod. Let’s use it for our bimodal distribution here.

> Rmixmod::mixmodCluster(s2,2)
****************************************
*** INPUT:
****************************************
* nbCluster =  2 
* criterion =  BIC 
****************************************
*** MIXMOD Models:
* list =  Gaussian_pk_Lk_C 
* This list includes only models with free proportions.
****************************************
* data (limited to a 10x10 matrix) =
 [1] -0.6857  -0.4116  0.01116  -0.5688  -0.5654  -0.4467  -0.2363  -0.09717 -0.4587  -0.4459 
* ... ...
****************************************
*** MIXMOD Strategy:
* algorithm            =  EM 
* number of tries      =  1 
* number of iterations =  200 
* epsilon              =  0.001 
*** Initialization strategy:
* algorithm            =  smallEM 
* number of tries      =  50 
* number of iterations =  5 
* epsilon              =  0.001 
* seed                 =  NULL 
****************************************


****************************************
*** BEST MODEL OUTPUT:
*** According to the BIC criterion
****************************************
* nbCluster   =  2 
* model name  =  Gaussian_pk_Lk_C 
* criterion   =  BIC(12607.9392)
* likelihood  =  -6282.2208 
****************************************
*** Cluster 1 
* proportion =  0.4996 
* means      =  -0.4983 
* variances  =  0.0423 
*** Cluster 2 
* proportion =  0.5004 
* means      =  1.5012 
* variances  =  0.3616 
****************************************
>

The output tells that there are two components or clusters, their mixture is about 50/50, the means are estimated as -0.4983 and 1.5012, and estimates of variances are 0.0423 and 0.3616 correspondingly. Compare this with the data generation mechanism: 0.5 * N(-0.5,0.2) + 0.5 * N(1.5,0.6) and recall that the variances are squared standard deviations.

This case is obviously very simple, if not trivial, because exactly two well separated Gaussians were used to generate the data. The package Rmixmod however is capable of much more sophisticated analyses including mixtures of multivariate multinomial and multivariate Gaussians distributions with nontrivial covariance matrices.

In practice, when the distribution your boss is asking you about is multimodal, it is a good idea to understand its components and tie them back to the business. A discovered multimodality is a great opportunity to tell the boss something she does not know yet. The important thing here is not to forget that first of all you have to give her “the average” she wanted!

Cheers,
–Vlad

Leave a Reply

Your email address will not be published. Required fields are marked *