Pivoting Data With {tidyverse} and {data.table} in R

Gus Lipkin
8 min readMar 31, 2022

--

Pivoting data can be a little scary sometimes, even if you know exactly what you want to do. But it doesn’t need to be. I’m hoping that by the end of this post, you’ll feel like the pivot pro you always knew you could be.

A giraffe (long) and a manta ray (wide)

Getting Started

As usual, the first thing we want to do is load our libraries. We’ll be using pivot_wider and pivot_longer from tidyverse as well as the pipe operator, %>%, from magrittr which is loaded through tidyverse. We’re also using dcast and melt from data.table as they are twins to pivot_wider and pivot_longer, respectively. While they are not identical, they perform the same functions and operate somewhat similarly.

Next, we want to create our dummy data. I’ve been working with theme park data a lot recently, so I modeled the dummy data after that. The data description is as follows:

Once created, the data looks like this:

What Are…

While I will show both pivot_wider/dcast and pivot_longer/melt in this post, I will default to the pivot_* term as it generally makes more sense on the surface.

`pivot_wider` and `dcast`

pivot_wider and dcast take data and reshape it so that there are more columns and fewer rows than the input data. It allows you to specify a column as a unique identifier and use the values in one or more columns as new column names. The last column is the values that you want to be placed in the proper intersection between the identifier column value and each new column value.

Original data on the left and wider data on the right (Data.csv and PivotWider.csv)

In the example above, I’ve pivoted the actualWait using the time and shortName columns. We can see that the first five rows of values in actualWait have become the first row of values in the wider data. The last row with The Amulet from Below at 8:05 becomes the value for amulet in the second row of the wider data where the time is 8:05.

While pivot_wider and dcast perform the same function, they behave a little bit differently. pivot_wider uses three main arguments id_cols, names_from, and values_from. dcast, on the other hand, uses a formula argument in place of both id_cols and names_from and value.vars in place of values_from. The second big difference is that column names pivot_wider don’t have to be in quotes. They can be, but they don’t have to be. On the other hand, only the formula dcast can go without quotes, and in that case, they must not have quotes while value.var must have quotes.

`pivot_longer` and `melt`

pivot_longer and melt take data and reshape it so that there are more rows and fewer columns than the input data. It allows you to specify columns that you want to be aggregated and the new column names for what were previously separate values columns.

Original data on the left and longer data on the right (Data.csv and PivotLonger.csv)

In the example above, the estimatedWait and actualWait columns are consolidated into the waitType the column where each row specifies if the corresponding waitTime the column is an estimated or actual wait time.

Like the previous two, pivot_longer and melt have some differences, although these are smaller. pivot_longer uses three primary arguments: cols, names_to, and values_to which correspond to measure.vars, variable.name, and value.name, respectively, from melt. Again, the data.table version, melt, requires quotes around each argument value while pivot_longer only requires quotes around names_to and values_to because cols is acting as a select statement.

Checking for Duplicates

One of the most important things to keep in mind before using pivot_wider or dcast is knowing if you should have duplicate data or not. If you shouldn’t have any duplicates, you’ll want to double-check to make sure. To show you how to do this, I’ve added the first five rows of dt to the end of dt. If you were to try and pivot the resulting data.table, each value cell would become a list rather than a single value and would look something like this:

That’s okay if you’re expecting multiple values, you’ll just need to remember use use values_fn for pivot_wider or fun.aggregate for dcast to make sure they are being handled appropriately. A common use-case would be something similar to pivot tables in Excel where you want to count each item. In that case, your function argument would be count.

The basic format to check for duplicates is x[duplicated(x)]. However, duplicated() only returns TRUE for the second or more occurrence of an item in x. In order to check for all duplicates, we need to also check fromLast like this: x[duplicated(x) | duplicated(x, fromLast = TRUE)]. Once you have identified your duplicate rows, it is up to you what to do. In some cases, you might be okay using unique(x) to get only unique items, or you might want to discard all duplicated items altogether with x[!duplicated(x) & !duplicated(x, fromLast = TRUE)].

Pivot Wider

Wider with One Value Column

In this example, we want to know how long the actualWait is for each ride throughout the day with each recording time as a row, and each ride as a column.

In the pivot_wider, we want our new column names_from to be the shortName and our values_from actualWait. The tricky bit here is the id_cols argument. id_cols are columns whose values should uniquely identify each row in the data. We need to use it because the rideName and shortName columns are perfectly correlated, that is, for each row the value of both columns will always be the same. Any column that is not in the names_from or values_from the argument will be part of the default id_cols argument. We can use the argument in two ways; we can use a numeric vector to select the columns that we want to be used as identifiers, or we can specify columns that we do not want to be used as a character or column name vector. In the example below, I’ve included both methods. For the rest of the post, I’ll be using the second method.

As mentioned previously, the biggest difference for dcast is that it uses a formula instead of the id_cols and names_from arguments. The left-hand side of the formula is the columns that you want to stay as columns to be used as the unique identifier for each row. The right-hand side of the formula is the column whose values will be used as the new column names. Lastly, value.var defines the column with the desired values.

Wider with Two or More Value Columns

Wider data with more than one value column is very similar to one value column. The only changes you need to make in pivot_wider are changing your id_cols argument if using the column name method and adding the new column to the values_from argument. The only change needed in dcast is adding to the value.var argument. One important thing to note is that the column names will now be a combination of the names_from/right hand side argument and the values_from/value.var argument. The column name order in the value* argument determines the column order in the resulting data.

Wider with Two or More ID Columns

Again, this is pretty similar to a “normal” wider operation. With pivot_wider you expand your id_cols to include the new columns, or, in my case, drop the attendance column so that it is now included. With dcast, you simply add the new column to the left hand side of the formula.

Pivot Longer

pivot_longer and melt are very similar. The largest differences are the argument names and where they require argument values to be in quotes. cols/measure.vars are the columns that you want to be aggregated into a single new column. This argument must be in quotes for melt, but does not have to be for pivot_longer and should be the names of existing columns. names_to/values_to is the name of the column that will say which column was used for that row and values_to/value.name are the corresponding values. The name for each of these new columns is up to you. I have decided on waitType to describe if it is an estimatedWait or actualWait and waitTime to hold the value.

I’ve saved a copy of the longer data as dtLonger so that we can explore pivot_wider and dcast a bit more.

Pivot Wider (from Longer)

Wider with Two or More Name Columns

We want to take our new longer data and change it to wider data where we know how long the actualWait and estimatedWait are for each ride throughout the day with each recording time as a row, using the waitType and each ride’s shortName.

For pivot_wider, we want our column names to use both the waitType and shortName and our values from waitTime. id_cols is then the remaining columns that we don’t want. With dcast, we want to add waitType and shortName to the right hand side of the formula so they are used as the column names, and time to the left hand side for the row identifiers. For both, values_from and value.var are our waitTime column.

Like when using multiple values in value_from/value.var, the new column names are all combinations of values in names_from or the right hand side. The combinations will be given in the order the column names are specified.

Wider with Mutiple ID and Name Columns

If you’ve been following along so far, the code below should make sense. We want to use time and attendance as the row identifiers and the estimatedWait and actualWait as the values for each ride.

Wrapping Things Up

A Quick Benchmark

I decided to benchmark the methods inPivotWider.R and PivotLonger.R to see how they stack up. data.table methods are faster than tidyverse and wider to longer is faster than longer to wider. Of course, the actual speeds can vary a lot depending on data size. You might also want to consider how readable your code is and the standard set of libraries you’re using if you’re pivoting data at work.

Conclusion

I’m hoping that by now you’re more comfortable with pivot_wider/dcast and pivot_longer/melt than you were before. Generally, before you begin reshaping data, you want to know what your resulting data should look like and if you’ll have any duplicate rows. From there, decide if you want to use tidyverse or data.table and then write your pivots!

All the code for this article is available here. If you want to see more from me, check out my GitHub or guslipkin.github.io. If you want to hear from me, I’m also on Twitter at guslipkin.

Gus Lipkin is a Data Scientist, Business Analyst, and occasional bike mechanic

--

--

Gus Lipkin
Gus Lipkin

Written by Gus Lipkin

My roommate said you can always rely on me to burn my food. At least I’m reliable | data scientist and occasional 🚲 mechanic | he/him | guslipkin.me

No responses yet