Potentially Useful

Writing boring posts is okay

Indexing matrix elements in R

in: Programming Data Science tagged: R

I came to science with a background in engineering, but most of my scientist friends didn’t. I often have a hard time articulating why I’m so annoyed by one-based indexing–which R and MATLAB use, but most other programming languages don’t. Here’s a recent example that might help.

For a simulation I’m running, I use the values in several of the columns of a data frame as indexes into separate vectors. Here’s an example of using indexes in one vector (instead of a column from a data.frame) to access the elements in another:

valueVector <- c(2, 5, 8, 3, 0)
indexVector <- c(2, 3, 3, 1, 4, 5, 5)
valueVector[indexVector]
## [1] 5 8 8 2 3 0 0

R veterans will point out that you can use factors for this, and that’s definitely true here. However, when the values in the small vector are changing often but the indices are relatively stable, I prefer this approach. Either strategy works.

Unfortunately, things aren’t so easy when the data is in a matrix (a 2D vector) and you want to access its elements using two index vectors (i.e., one indexing the matrix’s rows, and the second indexing its columns). R’s default behavior might not be what you expect:

valueMatrix <- matrix(LETTERS[1:15], ncol = 3)
valueMatrix
##      [,1] [,2] [,3]
## [1,] "A"  "F"  "K" 
## [2,] "B"  "G"  "L" 
## [3,] "C"  "H"  "M" 
## [4,] "D"  "I"  "N" 
## [5,] "E"  "J"  "O"
rowIndices <- c(4, 2, 2, 4, 4, 4)
colIndices <- c(2, 3, 3, 2, 3, 2)
valueMatrix[rowIndices, colIndices]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "I"  "N"  "N"  "I"  "N"  "I" 
## [2,] "G"  "L"  "L"  "G"  "L"  "G" 
## [3,] "G"  "L"  "L"  "G"  "L"  "G" 
## [4,] "I"  "N"  "N"  "I"  "N"  "I" 
## [5,] "I"  "N"  "N"  "I"  "N"  "I" 
## [6,] "I"  "N"  "N"  "I"  "N"  "I"

Instead of returning the six values associated with the six row/column pairs, it returns a 6 × 6 matrix with the elements of interest along the diagonal. When the number of elements is this small, it’s easy to wrap this in a call to diag(); that approach isn’t appropriate for long index vectors, since memory requirements are squared.

Readers who’ve done data manipulation in C are probably familiar with pointer arithmetic. Zero-based numbering makes it easy to combine row and column indexes and make a one-dimensional array behave like a matrix:

/* When the code has to deal with matrices of varying size, you can’t allocate 
   an array right away. */  
double *valueMatrix;

/* Stuff happened, and now you know how big your matrix will be (nrow x ncol), 
   so you allocate memory on the “heap” to store the data. */
valueMatrix = (double *)malloc(sizeof(double) * nrow * ncol);

/* This is how you’d access specific elements. */
oldValueAtRowCol = valueMatrix[row + nrow*col];
valueMatrix[row + nrow*col] = newValueAtRowCol; 

/* Can’t forget this when you’re done; I don’t miss C. */
free(valueMatrix);
valueMatrix = 0;

/* NB: All programs work this way, but most scripting languages take care of 
   this stuff for you. Progress! */

To its credit, R makes this easy too; everything is stored as a one-dimensional vector behind the scenes, and can be accessed as such. R’s use of one-based indexing just makes it look a bit more awkward:

valueMatrix[rowIndices + nrow(valueMatrix) * (colIndices - 1)]
## [1] "I" "L" "L" "I" "N" "I"

I hope this is useful for anybody who wants to access matrix elements using index vectors. More importantly, maybe some scientists-turned-programmers can gain a bit of insight into why zero-based numbering make sense to those of us who cut our teeth on C (or languages like it).


Update

After I posted this, Andrie de Vries sent a tweet sharing an alternate syntax that’s more R-like:

valueMatrix[cbind(rowIndices, colIndices)]
## [1] "I" "L" "L" "I" "N" "I"

I definitely agree that it looks better, but there is a slight performance hit due to the call to cbind(). If you’re repeatedly accessing a matrix with the same pair of indices, it might be worth it store the bound pair as a variable and reuse that. Here’s the benchmark performance of each of the approaches I’ve discussed:

library("microbenchmark")

rowIndices <- sample(1:5, 1e4, replace=TRUE)
colIndices <- sample(1:3, 1e4, replace=TRUE)
boundIndices <- cbind(rowIndices, colIndices)

op <- microbenchmark(diag_matrix = diag(valueMatrix[rowIndices, colIndices]),
                     pointer_math = valueMatrix[rowIndices + nrow(valueMatrix) * (colIndices - 1)], 
                     array_indexing = valueMatrix[cbind(rowIndices, colIndices)], 
                     array_indexing_prebound = valueMatrix[boundIndices], 
                     times = 100)
print(op)
## Unit: microseconds
##                     expr         min           lq         mean      median
##              diag_matrix 1692950.428 1984553.2750 2024360.4989 2031899.941
##             pointer_math     243.356     251.9710     265.7848     260.419
##           array_indexing     332.864     352.2625     373.6505     367.539
##  array_indexing_prebound     149.192     154.6295     165.4170     160.201
##            uq         max neval
##  2056263.5955 2201740.926   100
##      272.5130     331.070   100
##      387.1940     544.230   100
##      170.5105     277.498   100

Indexing with the pre-bound pair is fastest, using arithmetic on the indexes is a close second, and calling cbind() inside the brackets is in third place. Creating the n × n matrix and extracting its diagonal is excessively slow (and uses up a lot of RAM), so don’t ever do that.