Switching Between Tidyverse and Pandas for Tabular Data Wrangling: Tidying Data

hacking skills
Author

zenggyu

Published

2019-01-02

Abstract
Tidyverse and pandas alternatives for common data tidying operations.

Introduction

The third post of the “Switching Between Tidyverse and Pandas for Tabular Data Wrangling” series focus on another important aspect of data wrangling: tidying data (see Figure 1 , which is taken from (Wickham and Grolemund 2016)).

Figure 1: Data science process

As with the other posts in the series, this one does not intend to give a comprehensive introduction to related functions/methods and their usages. Instead, it only focuses on the most essential ones based on my personal experience. Where multiple solutions exist for the same purpose, the ones that minimizes the difference between tidyverse and pandas are chosen. Hopefully, these efforts will make the switch between the two packages easier.

Note: currently, this post only covers data-reshaping operations; more content may be added in the future.

Update: A new pair of functions (i.e., pivot_longer() and pivot_wider()) are now available in the tidyr R package, which are meant to replace some existing functions (i.e., gather() and spread(), which are described in this post). These successors are much more powerful and easier to use; check the vignette for a detailed introduction.

Prerequisites

The code provided in the next few sections requires the following packages:

  • tidyverse for R;
  • pandas for Python.

See my other posts (this and this) for instructions on how to install these packages.

Before running the code, make sure that the required packages are properly loaded:

# R
library(tidyverse)
# Python
import pandas as pd

Reshaping data

In this section, I will use Table 1 , which is taken from (Wickham 2014) with some modifications, as an example for demonstration. The table shows the frequencies of interviewee across different religions and income groups in the US.

Table 1: Religions and income groups
Religion 0K~20K 20K~40K 40K~75K
Agnostic 61 141 213
Atheist 39 89 105
Buddhist 48 64 91
Catholic 1035 1402 1754
Others 2040 2660 3053

According to Wickham (2014), this dataset can not be considered “tidy”. The three columns *K~*K are not names of variables, but values of one variable (Income). As a result, each row stores three observations (Frequency) instead of one.

To tidy this dataset, we need to stack the income columns and the corresponding frequencies into a new pair of variables, i.e., Income and Frequency. In tidyverse jargon, Income is referred to as the key, while Frequency is referred to as the value. For this purpose, tidyverse and pandas respectively offers the gather() function and the melt() method. The following code shows how to perform this kind of reshaping, assuming df_unstacked is the name of the original data frame:

# R
df_stacked <- df_unstacked %>% gather(key = "Income", value = "Frequency",
                     `0K~20K`, `20K~40K`, `40K~75K`)
# Or equivalently:
df_stacked <- df_unstacked %>% gather(key = "Income", value = "Frequency", -Religion)

Note that since 0K~20K, etc. are not syntactic names for columns, we have to surround them in backticks.

# Python
df_stacked = df_unstacked.melt(var_name="Income", value_name = "Frequency",
                               id_vars=["Religion"],
                               value_vars=["0K~20K", "20K~40K", "40K~75K"])

In some cases, you may also want to reverse the operation, i.e., unstack some columns. For this purpose, tidyverse provides the spread() function while pandas provides the pivot() method:

# R
df_unstacked <- df_stacked %>% spread(key = Income, value = Frequency)
# Python
df_unstacked = df_stacked.pivot(index = "Religion", columns = "Income", values = "Frequency")

Note that the pivot() method returns a data frame with the variable Religion as an index, not a column. To convert the index to a column, use reset_index(). E.g., df_unstacked = df_stacked.pivot(index = "Religion", columns = "Income", values = "Frequency").reset_index(). Additionally, you may also want to remove the name of the column label (Income); to do this, use df_unstacked.rename_axis("", axis = 1).

Figure 2 provides a visualization of the process of stacking/un-stacking the data set. Note how each component is colored differently: the cells associated with the key (i.e., Income) are green; the cells associated with the value (i.e., Frequency) are orange; the remaining cells (associated with Religion) are blue.

Figure 2: Stacking and un-stacking data

Learning more

If you want to learn more about transforming data using tidyverse or pandas, here are some useful resources:

  • R for Data Science by Wickham and Grolemund (2016);
  • Python for Data Analysis by McKinney (2018);
  • The documentation of tidyr (a module of tidyverse for data tidying) (tidyverse.org n.d.);
  • The documentation of pandas (pydata.org n.d.).

References

References

McKinney, Wes. 2018. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. Second edition. Sebastopol, California: O’Reilly Media, Inc.
pydata.org. n.d. “Reshaping and Pivot Tables — Pandas 0.23.4 Documentation.” Accessed January 6, 2019. http://pandas.pydata.org/pandas-docs/stable/reshaping.html.
tidyverse.org. n.d. “Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions • Tidyr.” Accessed January 2, 2019. https://tidyr.tidyverse.org/.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly.