Writing Faster R with Vectorization and the {apply} family

Gus Lipkin
6 min readMar 14, 2022

--

One of my favorite things about R is that there are a lot of ways to do the same thing. Of course, this means that some ways are better than others depending on the use case. for loops, the apply family and vectorization are all common ways to write code for large amounts of data in R, but it can be tricky to know when to use each one and how to use them.

I’ve divided this post into how to use each method in R and then give a few examples of when you might want to use each one. I close everything out with a short benchmark demonstration to compare the three.

What is a(n)…

`for` loop

If you’re familiar with programming, you can probably skip this section.

A for loop lets you run the same code a specified number of times. The structure generally follows for(x in y) where x represents an item in y. If we think about a shopping basket with some apples, bananas, and carrots, we could write for(food in basket) and food would represent each item in our basket. It would be apples the first time, bananas the second time, and carrots the third time. We could also write it as for(food in 1:length(basket)) where 1:length(basket) is a vector of numbers that counts the items in your basket. Rather than food representing an item in your basket, it represents an index in the vector. In this example, apples are at index 1, bananas at 2, and carrots at 3. for loops are also very flexible and can be used on many data types such as vectors, data.frames, and matrices.

Let’s say you have a data.frame called basket that has three columns. It has the Food column with the name of the food, the PricePerUnit which has the unit cost for each food, and Quantity which has the number of units of each food in your basket. It looks like this:

And it can be recreated with this:

If we wanted to get the total cost of everything in our basket, we could iterate over each row multiplying the PricePerUnit and Quantity and adding those to our running totals.

`apply` family

The apply family is part of base R and very similar to a for loop. Rather than running a set number of times, an apply runs a function on each item in a data.frame, list, vector, or other object that can be applied to. While there are six different functions in the apply family, I’m only going to talk about the three most common; apply, lapply, and sapply.

The biggest differences between the three is the types of input that they accept and their output types. apply takes in a data.frame or matrix and has three function arguments. The first argument, x, is the object we’re passing to it. The second argument is a number, either 1 or 2 or c(1, 2), that says if we want the function applied to rows, columns, or both rows and columns, respectively. The last argument is the function call. sapply and lapply are the same, except they don’t have the second argument because they take either a vector or list which don’t have multiple dimensions. Generally speaking, the apply family will return a vector, list, or array of some kind.

If we go back to the shopping basket example, we can calculate the total with an apply function. Our first argument is the basket, the second is a 1 because we want to apply to every row, and the last is the function call. We can create the function in the apply call or we can create it earlier and then call it here.

A quick note on function calls in the apply family:

If a function call only has one argument, they can be done in three ways.

  1. sapply(X, function(x) { ... }) if function is not predefined
  2. sapply(X, function) if function is predefined
  3. sapply(X, function(x)) if function is predefined

Option two is most common for built-in functions such as sum or as.numeric, but can be used with any function.

Vector Operations

Vector operations are not a function like the apply family or a for loop, but rather a feature of the R language. Instead of operating on a vector one item at a time, R is able to do an operation on the entire vector in one line of code. Back to the basket example again, we know that the per item total is the PricePerUnit and Quantity multiplied together, and then we get the grand total by summing all of those values.

When Should I use a(n)…

These examples are not exhaustive and you may find some cases where one is better than the others even where it seems like it might not be.

`for` loop

for loops in R should be a last resort. They are much slower compared to the apply family and vectorized code. They may be helpful when each iteration relies on the iteration before it, although then you might want to look into a recursive function if possible. You might find a for loop useful if you need to run the same block of code multiple times or iterate over elements of an object in a non-standard way such as every other item. Any code that can be written with an apply function or a vector operation can be written in a for loop.

`apply` family

The apply family should be used when you want to operate on each element of an object, but treat them individually. This might present as a list with vectors of differing lengths for each item or if you want a specific type of output. Any vector operation can be written as an apply statement, but not all for loops can be converted.

Vector Operations

Vector operations are the gold standard. They are fast and can be used in many cases, but not all. Most common use cases will be on vectors or columns of a data.frame. Many base functions such as sum and as.numeric are already vectorized. Many but not all for loops and apply functions can be written as vectorized operations.

Benchmarks

Building the input

Rather than use the simple shopping basket example from before, I’ve written a small function that takes a data.frame of red, green, and blue values and adds a new column with the corresponding hex code.

And the resulting data should look like this:

We’ve also created a vector of values that can go in a hex code with numbers 0–9 and letters A-F.

Creating the conversion function

I used this website for the math behind my functions. In essence, you divide each number by 16 and round down and the resulting number corresponds to a position in hex. You then take the remainder of the division and get the hex value that that number corresponds to. If our value is 227, then our first hex code is 227/16 would round down to 14 and the remainder would be 3. Because vectors in R start at position 1, we add one to both for 15 and 4. The corresponding values in hex are E and 3 and so the hex pair for 227 is E3.

Implementing the conversion function

In a for loop

In an apply function

In a vectorized function

The results

Running the benchmark

I’ve simplified the for loop and apply implementations a little bit to better match the vectorized function. This way we have a better comparison between the three. Your benchmark results may be a little different because it is a little dependent on your computer.

The important column is relative as that shows a comparison between the three with the quickest function given a value of 1. Using an apply function took roughly 20x longer and a for loop roughly 60x longer than using a vectorized function.

A hand drawn Sonic the Hedgehog saying “Gotta go fast”

All the code for this article is available here. If you want to see more from me, check out my GitHub or guslipkin.github.io. If you want to hear from me, I’m also on Twitter at guslipkin.

Gus Lipkin is a Data Scientist, Business Analyst, and occasional bike mechanic

--

--

Gus Lipkin
Gus Lipkin

Written by Gus Lipkin

My roommate said you can always rely on me to burn my food. At least I’m reliable | data scientist and occasional 🚲 mechanic | he/him | guslipkin.me

No responses yet