# Writing Faster R with Vectorization and the {apply} family

One of my favorite things about R is that there are a lot of ways to do the same thing. Of course, this means that some ways are better than others depending on the use case. `for`

loops, the `apply`

family and vectorization are all common ways to write code for large amounts of data in R, but it can be tricky to know when to use each one and how to use them.

I’ve divided this post into how to use each method in R and then give a few examples of when you might want to use each one. I close everything out with a short benchmark demonstration to compare the three.

# What is a(n)…

**`for**

**` loop**

**`for**

*If you’re familiar with programming, you can probably skip this section.*

A `for`

loop lets you run the same code a specified number of times. The structure generally follows `for(x in y)`

where `x`

represents an item in `y`

. If we think about a shopping basket with some apples, bananas, and carrots, we could write `for(food in basket)`

and `food`

would represent each item in our basket. It would be apples the first time, bananas the second time, and carrots the third time. We could also write it as `for(food in 1:length(basket))`

where `1:length(basket)`

is a vector of numbers that counts the items in your basket. Rather than food representing an item in your basket, it represents an index in the vector. In this example, apples are at index 1, bananas at 2, and carrots at 3. `for`

loops are also very flexible and can be used on many data types such as vectors, data.frames, and matrices.

Let’s say you have a `data.frame`

called `basket`

that has three columns. It has the `Food`

column with the name of the food, the `PricePerUnit`

which has the unit cost for each food, and `Quantity`

which has the number of units of each food in your basket. It looks like this:

And it can be recreated with this:

If we wanted to get the total cost of everything in our basket, we could iterate over each row multiplying the `PricePerUnit`

and `Quantity`

and adding those to our running totals.

``apply`

` family

The `apply`

family is part of base R and very similar to a `for`

loop. Rather than running a set number of times, an `apply`

runs a function on each item in a `data.frame`

, `list`

, `vector`

, or other object that can be applied to. While there are six different functions in the `apply`

family, I’m only going to talk about the three most common; `apply`

, `lapply`

, and `sapply`

.

The biggest differences between the three is the types of input that they accept and their output types. `apply`

takes in a `data.frame`

or matrix and has three function arguments. The first argument, `x`

, is the object we’re passing to it. The second argument is a number, either 1 or 2 or `c(1, 2)`

, that says if we want the function applied to rows, columns, or both rows and columns, respectively. The last argument is the function call. `sapply`

and `lapply`

are the same, except they don’t have the second argument because they take either a vector or list which don’t have multiple dimensions. Generally speaking, the `apply`

family will return a vector, list, or array of some kind.

If we go back to the shopping basket example, we can calculate the total with an `apply`

function. Our first argument is the `basket`

, the second is a `1`

because we want to `apply`

to every row, and the last is the function call. We can create the function in the `apply`

call or we can create it earlier and then call it here.

**A quick note on function calls in the ****apply**** family:**

If a function call only has one argument, they can be done in three ways.

`sapply(X, function(x) { ... })`

if`function`

is not predefined`sapply(X, function)`

if`function`

is predefined`sapply(X, function(x))`

if`function`

is predefined

Option two is most common for built-in functions such as `sum`

or `as.numeric`

, but can be used with any function.

## Vector Operations

Vector operations are not a function like the `apply`

family or a `for`

loop, but rather a feature of the R language. Instead of operating on a vector one item at a time, R is able to do an operation on the entire vector in one line of code. Back to the `basket`

example again, we know that the per item total is the `PricePerUnit`

and `Quantity`

multiplied together, and then we get the grand total by summing all of those values.

# When Should I use a(n)…

These examples are not exhaustive and you may find some cases where one is better than the others even where it seems like it might not be.

## `for` loop

`for`

loops in R should be a last resort. They are much slower compared to the `apply`

family and vectorized code. They may be helpful when each iteration relies on the iteration before it, although then you might want to look into a recursive function if possible. You might find a `for`

loop useful if you need to run the same block of code multiple times or iterate over elements of an object in a non-standard way such as every other item. Any code that can be written with an `apply`

function or a vector operation can be written in a `for`

loop.

## `apply` family

The `apply`

family should be used when you want to operate on each element of an object, but treat them individually. This might present as a list with vectors of differing lengths for each item or if you want a specific type of output. Any vector operation can be written as an `apply`

statement, but not all `for`

loops can be converted.

## Vector Operations

Vector operations are the gold standard. They are fast and can be used in many cases, but not all. Most common use cases will be on vectors or columns of a `data.frame`

. Many base functions such as `sum`

and `as.numeric`

are already vectorized. Many but not all `for`

loops and `apply`

functions can be written as vectorized operations.

# Benchmarks

## Building the input

Rather than use the simple shopping basket example from before, I’ve written a small function that takes a `data.frame`

of red, green, and blue values and adds a new column with the corresponding hex code.

And the resulting data should look like this:

We’ve also created a vector of values that can go in a hex code with numbers 0–9 and letters A-F.

## Creating the conversion function

I used this website for the math behind my functions. In essence, you divide each number by 16 and round down and the resulting number corresponds to a position in `hex`

. You then take the remainder of the division and get the `hex`

value that that number corresponds to. If our value is `227`

, then our first hex code is `227/16`

would round down to `14`

and the remainder would be `3`

. Because vectors in R start at position 1, we add one to both for `15`

and `4`

. The corresponding values in `hex`

are `E`

and `3`

and so the hex pair for `227`

is `E3`

.

## Implementing the conversion function

**In a ****for**** loop**

**In an ****apply**** function**

**In a vectorized function**

**The results**

## Running the benchmark

I’ve simplified the `for`

loop and `apply`

implementations a little bit to better match the vectorized function. This way we have a better comparison between the three. Your benchmark results may be a little different because it is a little dependent on your computer.

The important column is `relative`

as that shows a comparison between the three with the quickest function given a value of 1. Using an `apply`

function took roughly 20x longer and a `for`

loop roughly 60x longer than using a vectorized function.

All the code for this article is available here. If you want to see more from me, check out my GitHub or guslipkin.github.io. If you want to hear from me, I’m also on Twitter at guslipkin.

*Gus Lipkin is a Data Scientist, Business Analyst, and occasional bike mechanic*