The Grammar of Graphics and Tidy Data

The Grammar of Graphics and Tidy Data

Grammar of Graphics

If you have used R you have probably used ggplot2, the go-to plotting package. This package was created by Hadley Wickham. He describes ggplot in depth in this paper

"A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics." It builds on the paper by Wilkinson, Anand, and Grossman (2005) and implements the ideas to create superior graphs. I highly recommend reading both of these. 

Thinking in terms of a grammar of graphics helps you intuit how to visualize what you want. It will save you time and effort trying to memorize the syntax for creating a bar chart vs a stacked bar chart with a dual axis. One downside, though, is you will start seeing all the limitations of your current 'all in one' graphing solution that locks you into their general solutions. :) 


Tidy Data

After becoming familiar with the concepts in the paper above, you may find yourself frustrated with fitting your data into the structure ggplot requires. If you have done database work you may have run into a similar situation where a business requirement changed and you find yourself adding a new column with many null values or having to recreate an entire table first set up by your predecessor. 

The R package tidyverse is a collection of packages meant to help you be consistent with your data formatting. Like ggplot, learning the syntax is necessary to use these but the underlying concepts I find much more useful to a wide range of applications. You can read the paper on tidy data here. You may also go here for a less academic and more code heavy version of the full tidy data paper. 
The basic elements of tidy data are:

  1. Each variable you measure should be in one column.
  2. Each different observation of that variable should be in a different row.
  3. There should be one table for each "kind" of variable.
  4. If you have multiple tables, they should include a column in the table that allows them to be linked.


If this sounds familiar it is basically Codd’s 3rd normal form.

Every data science team member should be familiar with these concepts even if R is not widely used in your company.



Comments

Popular posts from this blog

Upcasting in python

Chart Types & Styles