Skip to content

Playing with factors

Playing with factors published on No Comments on Playing with factors

I have always struggled with R factors. What are they? How to manipulate them? More importantly, how to think about them? Today I finally sat down and spent a few hours playing around trying to understand them better. The results are below.

Verdict: dangerous

Create

Let’s start with a character vector and convert it into a factor

 
> continents.vec <- c("Africa", "Europe", "Asia", "North America", "South America", "Antarctica", "Australia") 
> continents.vec
[1] "Africa"        "Europe"        "Asia"          "North America" "South America" "Antarctica"    "Australia"    
> continents <- factor(continents.vec) 
> continents
[1] Africa        Europe        Asia          North America South America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe North America South America
> 

The first thing to notice is that in contrast with vectors, the factor is sorted in alphabetical order. More accurately, levels of the factor are sorted alphabetically.

> levels(continents)
[1] "Africa"        "Antarctica"    "Asia"          "Australia"     "Europe"        "North America" "South America"
> 

To convert a factor back to vector, i can simply use as.vector()

> as.vector(continents)
[1] "Africa"        "Europe"        "Asia"          "North America" "South America" "Antarctica"    "Australia"
>

It is possible to create a factor with the predetermined order of levels:

> o <- rev(levels(continents))  ## reversed order of levels(continents)
> o
[1] "South America" "North America" "Europe"        "Australia"     "Asia"          "Antarctica"    "Africa"
> continents2 <- factor(continents.vec, levels=o)  ## specify the levels ordering 
> continents2
[1] Africa        Europe        Asia          North America South America Antarctica    Australia    
Levels: South America North America Europe Australia Asia Antarctica Africa
> 

Notice that the order of “underlying’ vector of the factor continents2 is the same as continents and continents.vec

Update

The function levels() applied to a factor returns a vector that can be accessed by index. Accessing the factor itself by index results in something entirely different.

> levels(continents)[6]  ## The result is a factor with one element
[1] "North America"
> continents[6]
[1] Antarctica
Levels: Africa Antarctica Asia Australia Europe North America South America
>

It appears that the last statement accesses continents.vec used to create factor continents.

> continents[1:7]
[1] Africa        Europe        Asia          North America South America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe North America South America
> 

It looks like we should be able to modify an element of a factor f in two equivalent ways, by accessing it as levels(f)[x] where x is an index value, or directly as f[x]. This guess turns out to be wrong.

> levels(continents)[6]
[1] "North America"
> levels(continents)[6] <- "N.A." 
> continents
[1] Africa        Europe        Asia          N.A.          South America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe N.A. South America
> continents[4]
[1] N.A.
Levels: Africa Antarctica Asia Australia Europe N.A. South America
> continents[4] <- "North America"  ## Change it back
Warning message:
In `[<-.factor`(`*tmp*`, 4, value = "North America") : invalid factor level, NA generated > continents
[1] Africa        Europe        Asia          <NA>          South America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe N.A. South America
>   

Not only we failed to update the value, but we lost data; our factor now contains NA. Notice however that Levels remain intact. Let’s restore the previous value. The right way to do it is to restore levels first (although they seem to be unmodified) and then update the factor itself. I don’t know why it works this way. Perhaps it is bug that will be fixed in the future (or an undocumented feature).

> levels(continents)[6] <- "North America" 
> continents   
[1] Africa        Europe        Asia          <NA>          South America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe North America South America
> continents[4] <- "North America" ## Now it works   
> continents     
[1] Africa Europe Asia North America South America Antarctica Australia 
Levels: Africa Antarctica Asia Australia Europe North America South America 
>   

Morale: Change levels of a factor f as levels(f)[x] <- "new value"

Insert

It turns out that in order to add a new value to a factor f we need to add this value to the vector levels(f) first and then add the new value to f itself. Important detail is that the new value has to be added to the end of levels(f).

> levels(continents) <- c(levels(continents),'Other')
> continents
[1] Africa        Europe        Asia          North America South America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe North America South America Other
> continents[length(continents) + 1] <- 'Other'
> continents
[1] Africa        Europe        Asia          North America South America Antarctica    Australia     Other        
Levels: Africa Antarctica Asia Australia Europe North America South America Other
>    

Removal of a value from the factor f does not remove the level

> continents <- continents[1:7]
> continents
[1] Africa        Europe        Asia          North America South America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe North America South America Other
> 

In order to remove redundant levels use f <- factor(f)

> continents <- factor(continents)
> continents
[1] Africa        Europe        Asia          North America South America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe North America South America> 

Merge

If a few levels have to be changed at the same time, the complete overwrite of the levels(f) is a viable option

> levels.old <- levels(continents) ## Save the old values 
> levels(continents) <- c(levels.old[1:5],"N.A", "S.A.") 
> continents
[1] Africa     Europe     Asia       N.A        S.A.       Antarctica Australia 
Levels: Africa Antarctica Asia Australia Europe N.A S.A.
> 

To merge two or more levels overwrite them with the same value

> levels(continents) <- c(levels.old[1:5],"Americas", "Americas") 
> continents
[1] Africa     Europe     Asia       Americas   Americas   Antarctica Australia 
Levels: Africa Antarctica Asia Australia Europe Americas
> 

Here is an interesting case. We merged two levels reducing the length of vector levels(f) by one, but the length of factor did not change.

> length(levels(continents))
[1] 6
> length(continents)
[1] 7
> 

In fact,

> data.frame(continents)
  continents
1     Africa
2     Europe
3       Asia
4   Americas
5   Americas
6 Antarctica
7  Australia
>

Morale: levels(f) is the list of unique values of factor f. Any change of levels(f) does not affect the length of f.

Split

Levels of the factor continents are not ordered now because they were overwritten by an explicitly specified vector, not constructed implicitly from a vector as we did in the beginning: continent <- factor(continents.vec). It would be a mistake to do something like levels(continents) <- sort(levels(continents)). It will completely overwrite the factor with new values.

> levels(continents)<- sort(levels(continents)) 
> continents
[1] Africa     Australia  Antarctica Europe     Europe     Americas   Asia      
Levels: Africa Americas Antarctica Asia Australia Europe
> 

Levels are sorted now, but the factor is entirely different. Notice how "Europe" became duplicated.

Restoration to the original state changes levels as expected, but the factor still have a duplicated value "North America"

> levels(continents) <- levels.old 
> continents
[1] Africa        Europe        Asia          North America North America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe North America South America
> 

It has to be changed by accessing continents[5] directly.

> continents[5] <- "South America" 
> continents
[1] Africa        Europe        Asia          North America South America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe North America South America
> 

Now everything seems to be back to normal.

Morale: Never reassign levels(f) to a vector of different length. Consequences are unpredictable and hard to track.

Subsetting

Subsetting of a factor results in another factor. The levels of resulting factor remain the same. To drop them, use drop = TRUE.

> continents[2:4]  ## Preserve levels
[1] Europe        Asia          North America
Levels: Africa Antarctica Asia Australia Europe North America South America
> continents[2:4, drop=TRUE] ## drop unused levels
[1] Europe        Asia          North America
Levels: Asia Europe North America
> 

Another way to drop unused levels is to rebuild the factor f <- factor(f). We used this pattern earlier.

> f1 <- continents[2:4] 
> f1
[1] Europe        Asia          North America
Levels: Africa Antarctica Asia Australia Europe North America South America
> f1 <- factor(f1) 
> f1
[1] Europe        Asia          North America
Levels: Asia Europe North America
> 

as.numeric()

Mapping between levels and factor values can be inspected by the call as.numeric(f).

> continents
[1] Africa        Europe        Asia          North America South America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe North America South America
> as.numeric(continents)
[1] 1 5 3 6 7 2 4
> 

In other words, levels(f)[as.numeric(f)] returns the values of f in the "correct" order

> levels(continents)[as.numeric(continents)]
[1] "Africa"        "Europe"        "Asia"          "North America" "South America" "Antarctica"    "Australia"    
> continents
[1] Africa        Europe        Asia          North America South America Antarctica    Australia    
Levels: Africa Antarctica Asia Australia Europe North America South America
> as.vector(continents)
[1] "Africa"        "Europe"        "Asia"          "North America" "South America" "Antarctica"    "Australia"    
> 

This convention often leads to confusion if the factor values are numbers themselves.

> set.seed(123)
> nf <- factor(rpois(5,4)) 
> nf  ## factor of integers
[1] 3 6 3 6 7
Levels: 3 6 7
> as.numeric(nf)
[1] 1 2 1 2 3
>

The correct way to "extract numbers" or parse the numeric factor is to convert its values to the character vector first and then to numeric.

> as.numeric(as.character(nf))
[1] 3 6 3 6 7  

Factors in data frames

All explained above remains valid when factor is a field of data frame. However, the data frame context sometimes leads to broader questions. As an illustration, let's take a look at the continents data frame with area given in square kilometers /1000

> continents.area
      continent    area
1        Africa 30370.0
2        Europe 10180.0
3          Asia 43820.0
4 North America 24490.0
5 South America 17840.0
6    Antarctica 13720.0
7     Australia  9008.5
> str(continents.area)
'data.frame':	7 obs. of  2 variables:
 $ continent: Factor w/ 7 levels "Africa","Antarctica",..: 1 5 3 6 7 2 4
 $ area     : num  30370 10180 43820 24490 17840 ...
> 

Now let's say, I want to display these data as a bar chart.

> library(lattice)
> barchart(area ~ continent, data=continents.area, main="Area, sq.km X 1000")
>

continentsArea

It does not look pretty. Continents are sorted alphabetically, which wastes one dimension. It would be better to sort them by area. The first and naive attempt does not do anything.

> barchart(area ~ continent
+         , data=continents.area[order(continents.area$area,decreasing = TRUE),]
+         , main="Area, sq.km X 1000")
>

Clearly, barchart() uses the levels ordering, not the records order of the data argument

> continents.area[order(continents.area$area,decreasing = TRUE),]
      continent    area
3          Asia 43820.0
1        Africa 30370.0
4 North America 24490.0
5 South America 17840.0
6    Antarctica 13720.0
2        Europe 10180.0
7     Australia  9008.5
>

The trick below rebuilds the factor continents.area$continent and sets levels in the decreasing order of continent's area. Notice that the record ordering of the data frame continents.area remains intact.

> continents.area$continent <- factor(continents.area$continent 
+        , levels=continents.area[order(continents.area$area,decreasing = TRUE)
                                 , "continent"] 
+ ) 
> continents.area
      continent    area
1        Africa 30370.0
2        Europe 10180.0
3          Asia 43820.0
4 North America 24490.0
5 South America 17840.0
6    Antarctica 13720.0
7     Australia  9008.5
> continents.area$continent
[1] Africa        Europe        Asia          North America South America Antarctica    Australia    
Levels: Asia Africa North America South America Antarctica Europe Australia
> barchart(area ~ continent
+          , data=continents.area
+          , main="Area, sq.km X 1000")
>

continentsAreaSorted

This looks much better

Cheers,
--Vlad

Leave a Reply

Your email address will not be published. Required fields are marked *