Moving Local Data Pipelines to Spark: The Easy Way with R and Python

Gus Lipkin
5 min readAug 27, 2022

Intro

As data people, we know that “the cloud” is usually a server somewhere maybe hosted on Azure, a server in a closet, or maybe a ten year old laptop underneath someone’s desk. When we hear people ask about moving data and processes to the cloud, it’s hard not to think of the Little Green Men from Toy Story worshiping “The Claw”. Within a few short breaths, you’ve been asked to try and move all your current processes to the cloud. It’s a daunting task, but with a few handy tricks, we can make the SQL conversion relatively painless.

Getting Started

I’ve generally tried to keep the R and Python code as similar as possible, but that’s not always the wisest move. If you have any questions, don’t hesitate to reach out.

Our first step is to load our packages and connect to spark. We’ll also create a local spark instance to use rather than link to a server. On the R side of things, our core packages are sparklyr and glue while Python is using pyspark and Python 3’s f-String functionality.

R

Python

Prepping the Data

At this point “in the real world” you’ll have loaded your data into the table format of your choosing. Where I work, all of our tables are in delta or parquet formats. For this example, rather than create dummy data for a post like I usually do, I’ve downloaded the Olympic Historical Dataset From Olympedia.org from Kaggle and moved everything to my data folder.

R

Python

By assigning the tables to both an R and Python variable and a name in Spark, we’re able to access the data from both an R and Python context, and a SQL context.

R

Python

Working With Views

Let’s say your database administrators (if they aren’t also you) are kind and have done a small amount of data cleaning and you usually access the data from a mount point which provides a view. You pull the schema for the athlete_bio_vw table and get the following SQL that references athlete_bio.

We can re-create the views three ways. We can use R or Python to recreate the view from scratch by referencing the table variable name, not the internal Spark name. We can also wrap a SQL statement in our language of choice and use the internal Spark name. We then assign this new data frame to tbl_athlete_bio_vw. However, because the new data frame has been assigned as a variable, we no longer have direct and easy access to the view inside further SQL queries, and would have to use R or Python to do any more analysis.

R

Python

The third way is a little chaotic, but is actually my preferred method because I can save the queries I want to use as subqueries later on. First we save the query as a string and then use the R {glue} package or f-strings from Python to run the SQL query in the tbl_athlete_bio_qry variable. Because this method lends itself to subqueries so well, I don’t usually use it unless I intend to use the query later on.

If you’re not familiar, both glue and f-strings take a regular string and insert the value of the variable inside curly braces {}. For example, if you have a variable named name that has someone’s name. You can use glue or f-strings to write "hello {name}" and when this is evaluated, if the name is Gus, it will print “hello Gus.”

R

Python

Subqueries

You probably know that the idea behind subqueries is that it lets you nest operations. The idea behind my third method (I really need a better name for it) is the same. Rather, instead of nesting queries directly, we’re nesting strings which are then evaluated as queries. In this example, we’re going to make another subquery called tbl_athlete_results_qry that contains all columns from the athlete_results Spark table for medal winners. We then want to return all rows from tbl_athlete_bio_vw for people who have won medals.

R

Python

If, for whatever reason, we wanted to go even deeper with the subqueries, we could. All we have to do is make sure we create our smallest level queries first then build those out inside the larger queries. Let’s save our query for athletes who have won medals as medal_athlete_qry. It’s important that when we save it, we use glue/f-strings to make sure our string is expanded properly. We can then use medal_athlete_qry as a subquery to count the number of medals won by each country.

R

Python

Assigning Variable Tables to Spark Tables

At this point, you might decide you want to assign your new tables back to Spark, rather than only keeping them in your R or Python environment. This turns out to be a relatively simple operation and we name tbl_athlete_bio_vw to athlete_bio_vw. We can then check the tables available to us to make sure the operation succeeded, and then run a short SQL query to be extra sure it worked.

R

Python

Dynamic Queries

The last piece that ties all of this together is using variables as different components of your data operations. A relatively common task could be to change dates on a monthly query to filter for the last thirty days. However, I don’t think any Olympians have been born in the last thirty days, probably because the last Olympics was more than thirty days ago. Nevertheless, we can filter for athletes that were born in the last thirty years. In R, the {lubridate} makes the year subtraction a breeze while python-dateutil does all the heavy lifting in Python.

R

Python

Disconnecting From Spark

Before we wrap up, let’s disconnect from Spark real quick.

R

Python

Wrapping Up

At the beginning of this post, I made a promise that moving your processes to the cloud doesn’t have to be super painful. It probably still will be painful, but I’m hoping that it’ll now be at most regular painful. We learned that we can either translate our queries to R or Python, run a SQL directly, or save the query as a string, and insert it into later queries where it will be run directly. We can then take our new tables and transfer them back over to Spark so we can reference the tables directly without any shenanigans involving glue or f-strings. Lastly, we can use those same glue and f-string tricks to dynamically change our queries based on R or Python variables.

Resources

Both R and Python have some really great resources out there to help you get started. I highly recommend starting with the documentation for sparklyr and PySpark as each have everything you need to know (besides the info in my post!) to get started.

- R: spark.rstudio.com

- Python: PySpark

All code in this article is available here. If you want to see more from me, check out my GitHub or guslipkin.github.io. If you want to hear from me, I’m also on Twitter @guslipkin.

Gus Lipkin is a Data Scientist, Business Analyst, and occasional bike mechanic

--

--

Gus Lipkin

My roommate said you can always rely on me to burn my food. At least I’m reliable | data scientist and occasional 🚲 mechanic | he/him | guslipkin.me