The third post of the “Switching Between Tidyverse and Pandas for Tabular Data Wrangling” series focus on another important aspect of data wrangling: tidying data (see Fig. 1, which is taken from (Wickham and Grolemund 2016)).
As with the other posts in the series, this one does not intend to give a comprehensive introduction to related functions/methods and their usages. Instead, it only focuses on the most essential ones based on my personal experience. Where multiple solutions exist for the same purpose, the ones that minimizes the difference between tidyverse and pandas are chosen. Hopefully, these efforts will make the switch between the two packages easier.
Note: currently, this post only covers data-reshaping operations; more content may be added in the future.
Update: A new pair of functions (i.e.,
pivot_wider()) are now available in the
tidyr R package, which are meant to replace some existing functions (i.e.,
spread(), which are described in this post). These successors are much more powerful and easier to use; check the vignette for a detailed introduction.
The code provided in the next few sections requires the following packages:
Before running the code, make sure that the required packages are properly loaded:
# R library(tidyverse)
# Python import pandas as pd
In this section, I will use Table 1, which is taken from (Wickham 2014) with some modifications, as an example for demonstration. The table shows the frequencies of interviewee across different religions and income groups in the US.
According to Wickham (2014), this dataset can not be considered “tidy”. The three columns
*K~*K are not names of variables, but values of one variable (
Income). As a result, each row stores three observations (
Frequency) instead of one.
To tidy this dataset, we need to stack the income columns and the corresponding frequencies into a new pair of variables, i.e.,
Frequency. In tidyverse jargon,
Income is referred to as the key, while
Frequency is referred to as the value. For this purpose, tidyverse and pandas respectively offers the
gather() function and the
melt() method. The following code shows how to perform this kind of reshaping, assuming
df_unstacked is the name of the original data frame:
# R df_stacked <- df_unstacked %>% gather(key = "Income", value = "Frequency", `0K~20K`, `20K~40K`, `40K~75K`) # Or equivalently: df_stacked <- df_unstacked %>% gather(key = "Income", value = "Frequency", -Religion)
Note that since
0K~20K, etc. are not syntactic names for columns, we have to surround them in backticks.
# Python df_stacked = df_unstacked.melt(var_name="Income", value_name = "Frequency", id_vars=["Religion"], value_vars=["0K~20K", "20K~40K", "40K~75K"])
In some cases, you may also want to reverse the operation, i.e., unstack some columns. For this purpose, tidyverse provides the
spread() function while pandas provides the
# R df_unstacked <- df_stacked %>% spread(key = Income, value = Frequency)
# Python df_unstacked = df_stacked.pivot(index = "Religion", columns = "Income", values = "Frequency")
Note that the
pivot() method returns a data frame with the variable
Religion as an index, not a column. To convert the index to a column, use
df_unstacked = df_stacked.pivot(index = "Religion", columns = "Income", values = "Frequency").reset_index(). Additionally, you may also want to remove the name of the column label (
Income); to do this, use
df_unstacked.rename_axis("", axis = 1).
Fig. 2 provides a visualization of the process of stacking/unstacking the dataset. Note how each component is colored differently: the cells associated with the key (i.e.,
Income) are green; the cells associated with the value (i.e.,
Frequency) are orange; the remaining cells (associated with
Religion) are blue.
If you want to learn more about transforming data using tidyverse or pandas, here are some useful resources:
- R for Data Science by Wickham and Grolemund (2016);
- Python for Data Analysis by McKinney (2018);
- The documentation of tidyr (a module of tidyverse for data tidying) (tidyverse.org 2019);
- The documentation of pandas (pydata.org 2019).
McKinney, Wes. 2018. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. Second edition. Sebastopol, California: O’Reilly Media, Inc.
pydata.org. 2019. “Reshaping and Pivot Tables — Pandas 0.23.4 Documentation.” Accessed January 6. http://pandas.pydata.org/pandas-docs/stable/reshaping.html.
tidyverse.org. 2019. “Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions • Tidyr.” Accessed January 2. https://tidyr.tidyverse.org/.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). doi:10.18637/jss.v059.i10.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly.