3. Intermediate plotting
[status: mostly-written]
Motivation
Plotting is maybe the main tool scientists use to develop their own insight into what they study, and is also the main tool they use to communicate their results and insights to others.
We continue to study plotting, introducing histograms and then learning how to plot directly from Python with matplotlib.
Prerequisites
The 10-hour “serious programming” course.
The “Data files and first plots” mini-course in Section 2.
Plan
We will:
Work through an example which motivates the introduction of histograms and of plotting from a program.
Introduce histograms.
Learn how to make plots directly in a python program with matplotlib.
3.1. A worked example
Histograms, bins, distributions… What are these? What kind of insight do they give?
Let’s start by looking at a data set on human height and seeing what we can do with it. We will use the Howell census data for the !Kung people of the Kalahari desert. I will guide you through trying to pull information out of this file, paying attention to where we introduce new ideas and techniques.
Download the file with
$ wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
$ wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell2.csv
Looking at these files we see that (a) they have the .csv
extension, meaning that they are “comma separated values”, but (b)
looking at the contents with head Howell1.csv
or less
Howell1.csv
shows that the data fields are separated by semicolons
instead of commas. We need to know this to give the right plotting
instructions.
Weight vs. height
Taking the “what do you do with witches” approach, let us immediately plot the data in this file:
$ gnuplot
gnuplot> set datafile separator ";"
gnuplot> plot 'Howell1.csv' using 1:2
What have we just done? we plotted the first two columns, and the file header tells us that those columns are height and weight, so let us make the plot more clear like this:
gnuplot> set datafile separator ";"
gnuplot> set xlabel "height (cm)"
gnuplot> set ylabel "weight (kg)"
gnuplot> plot 'Howell1.csv' using 1:2
What do we call this kind of plot? It’s a “scatter plot”: the height is not in any particular order, and there can be variability in weight for a given height, so there is no point in a line plot. That’s why we plot points for scatter plots.
Insight: taller people weigh more, but there is some variability (people can be overweight or underweight).
Terminology: scatter plot.
Terminology: we have plotted “weight versus height” or “weight as a function of height”.
Height vs. age
What more can we do? The first line tells us we also have age in column 3, so let’s look at height as a function of age:
gnuplot> set datafile separator ";"
gnuplot> set xlabel "age (yr)"
gnuplot> set ylabel "height (cm)"
gnuplot> plot 'Howell1.csv' using 3:1
Let’s add a grid to this one:
gnuplot> set grid
gnuplot> plot 'Howell1.csv' using 3:1
Type of plot: scatter plot of height vs. age.
Insight: you grow until age 20, then you stop growing. You might lose some height when you are really old.
Weight vs. age
We can guess that weight vs. age will look a bit like height vs. age, but check it out:
gnuplot> set datafile separator ";"
gnuplot> set xlabel "age (yr)"
gnuplot> set ylabel "weight (kg)"
gnuplot> set grid
gnuplot> plot 'Howell1.csv' using 3:2
Insight: as the !Kung people get older they weigh more until their 20s, then it stabilizes.
Distinguish female and male
The header in Howell1.csv
tells us that column 4 is 1 for male and
0 for female. What do we expect to see if we somehow distinguish the
male and female parts of the plot?
Let’s first make some guesses in our mind about what we should see. Would weight vs. height be significantly different for females and males? How about height vs. age and weight vs. age?
Now let’s plot it:
gnuplot> set datafile separator ";"
gnuplot> set ylabel "height (cm)"
gnuplot> set xlabel "sex (0=female, 1=male)"
gnuplot> set grid
gnuplot> plot 'Howell1.csv' using 4:1
If you peer at this plot closely you will see that the points are concentrated in a higher band for males than for females, but there are problems: (a) it’s deeply inelegant, (b) this visual insight could easily disappear with much more data, and (c) you cannot deduce any specific quantities from this plot.
We will look at a couple of ways in which this can be plotted better using the tools we have, but in the end we will realize that we are forcing our tools a bit, which is a sign that we need new tools.
First we try to split the file into two separate pieces. For this
file we can do it with the ever-amazing tool grep
. Note that the
lines for females have the snippet ;0
at line’s end and those for
males have ;1
at the end of a line, and now type:
$ grep ';0$' Howell1.csv > females.csv
$ grep ';1$' Howell1.csv > males.csv
$ gnuplot
gnuplot> set datafile separator ";"
gnuplot> set xlabel "age (yr)"
gnuplot> set ylabel "height (cm)"
gnuplot> set grid
gnuplot> plot 'females.csv' using 3:1
gnuplot> replot 'males.csv' using 3:1
The $
in the grep command matches the end of a line.
Insight: by age 20 the !Kung people have a height difference between men and women.
This took a bit of work. Another approach is to use a feature of gnuplot which colors the points according to the value of another column. This saves you from creating the two separat files. You can do it like this:
gnuplot> set datafile separator ";"
gnuplot> set xlabel "age (yr)"
gnuplot> set ylabel "height (cm)"
gnuplot> set grid
gnuplot> plot 'Howell1.csv' using 3:1:4 with points linecolor variable
This will show different colors for male and female.
Needing more tools
Both the approaches we saw to separate out female and male data are clumsy: the first one makes you create extra files, while the second one feels contrived and doesn’t give you very good control over the plot.
There is also another problem with scatter plots. They give good rapid insight, like “between the heights of of 140cm and 175cm there is agreat variability in weight. But they do not allow you quantify it. For example: the middle of the jumble of points does not allow you to say how many points are in there and to distinguish different parts. This is especially true of larger data sets.
Examples of questions you could not answer too well with the plots we have:
Which is the most common height among the !Kung people?
Are most adults close to that average height or does it vary a lot?
Can we see how those quantities vary for just grown-ups? Or just children of certain ages?
This leads us to introduce the new conceptual tool of the histogram and to discuss how to plot from within a python program.
3.2. Histograms
Looking at the question “hich is the most common height among the !Kung people?”, let us try to answer it this way: for each height from 135cm to 180cm, how many people are that tall?
Of course we can’t say “for each height”: if you measure precisely enough there will be just one person for each height!
So we break up that range into bins, for example 135 to 137, 137 to 139, …, 173 to 175, 175 to 179, 179 to 181. There should be some 22 bins, each of which is 2cm wide.
Then for each one of these bins we add up how many people have that height.
Terminology: the bins are those 2cm spaces, the bin edges are the minimum and maximum values for the bin. The bin width is the 2cm between the high edge and the low edge.
To do this, enter the program in Listing 3.2.1:
#! /usr/bin/env python3
"""Takes file where the first column has heights and makes a histogram
out of it, putting them into 2cm bins between 130cm and 180cm.
"""
import sys
import matplotlib.pyplot as plt
import numpy as np
def main():
fname = sys.argv[1]
heights = []
## define the "bin edges"
bin_edges = [i for i in range(135, 182, 2)]
n_bins = len(bin_edges) - 1
## initially all bins are empty
histogram = [0] * n_bins
## now open the file and read in all the values in the first
## column; put them in bins as we read them
with open(fname, 'r') as f:
lines = f.readlines()
for line in lines:
if line[0] == '"':
continue # skip the first line (it starts with a quote)
## break the line up into words, splitting on semicolons
words = line.split(';')
## the first word is the height; convert it to float
height = float(words[0])
## now put it in the right bin
for bin, low_edge in enumerate(bin_edges[:-1]):
if height >= low_edge and height < bin_edges[bin+1]:
## found it!
histogram[bin] += 1
## now write out a file with the midpoints of the bins on the x
## axis and the number of heights in that bin on the y axis
hist_out_fname = sys.argv[1] + '.hist'
with open(hist_out_fname, 'w') as f:
for bin in range(n_bins - 1):
midpoint = (bin_edges[bin] + bin_edges[bin+1]) / 2.0
f.write('%g %d\n' % (midpoint, histogram[bin]))
print('%g %d' % (midpoint, histogram[bin]))
print('wrote histogram to %s' % hist_out_fname)
print('you could plot in gnuplot with the command')
print('plot %s using 1:2 with boxes' % hist_out_fname)
if __name__ == '__main__':
main()
Run the program with:
$ ./make-height-histogram.py Howell1.csv
Then plot the resulting histogram with the following gnuplot instructions:
##REQUIRED_FILE: Howell1.csv
##REQUIRED_FILE: Howell1.csv.hist
##REQUIRED_FILE: females.hist
##REQUIRED_FILE: males.hist
##PRE_EXEC: wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
##PRE_EXEC: ./make-height-histogram.py Howell1.csv
##PRE_EXEC: grep ';0$' Howell1.csv > females.csv
##PRE_EXEC: grep ';1$' Howell1.csv > males.csv
set grid
set title 'height distribution of the !Kung people'
set xlabel 'height (cm)'
set ylabel 'number of people with that height
set style data histogram
set style fill solid 0.8 border -1
plot 'Howell1.csv.hist' using 1:2 with boxes
This plot seems to show two humps, one around 152cm and one around 160cm. But wait: haven’t we been told that height distribution should look like a bell shaped curve? This one does not.
So let us use the separate male and female data we obtained earlier
and that is in the files females.csv
and males.csv
:
First we make height histograms for females and males:
$ ./make-height-histogram.py females.csv
$ ./make-height-histogram.py males.csv
Then plot with:
##REQUIRED_FILE: Howell1.csv
##REQUIRED_FILE: Howell1.csv.hist
##REQUIRED_FILE: females.hist
##REQUIRED_FILE: males.hist
##PRE_EXEC: wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
##PRE_EXEC: ./make-height-histogram.py Howell1.csv
##PRE_EXEC: grep ';0$' Howell1.csv > females.csv
##PRE_EXEC: grep ';1$' Howell1.csv > males.csv
set grid
set title 'height distribution of the !Kung people'
set xlabel 'height (cm)'
set ylabel 'number of people with that height
set style data histogram
set style fill solid 0.8 border -1
plot 'Howell1.csv.hist' using 1:2 with boxes, \
'females.csv.hist' using 1:2 with boxes, \
'males.csv.hist' using 1:2 with boxes
We finally have what we were hoping for: a clear gaussian (bell-shaped) distribution of data around an average height. We had to separate male and female heights to get that.
Insight: these histograms of frequency of occurence of certain heights give us insight into the nature of human height.
To conclude: we have written a program which takes data and makes histograms out of it. It is useful to know how to do this, but we will see that this approach gets cumbersome (we have to run several different programs), so we will learn how to programs that use libraries to make histograms and to make plots.
We only looked at heights greater than 135cm because we were
interested in fully grown men and women. Do you see the problem
here? Some children might pass through height 135cm before they
reach full height, so we might get some data in there that is not
appropriate. Adjust make-height-histogram.py
to exclude people
under the age of 20 from the histograms.
Examine the CSV files at https://vincentarelbundock.github.io/Rdatasets/datasets.html using the approaches we have been using in this chapter.
3.3. Matplotlib
Matplotlib is a library for making plots within a python program. Using this we can manipulate data in a program and draw it right away, rather than always having to write out intermediate data files.
As with gnuplot, matplotlib allows us to make interactive plots or to write them out to graphics files (png, pdf, svg, …)
My opinion is that there are times when you want to use matplotlib and times when you want to write out text files and invoke gnuplot on it. Developing a feeling for what’s appropriate is a part of developing your personality as a scientist. A starting point is:
Use gnuplot (or another command line plotting program) for a quick exploration into a dataset, or for a reproducible command pipeline.
Use python+matplotlib when you are doing lots of data manipulation before generating the plot, or when you are generating plots at various stages of processing.
Before anything else we must install matplotlib for Python3:
$ sudo apt install python3-matplotlib
In class we follow the matplotlib tutorial offered by the developers of matplotlib:
https://matplotlib.org/users/pyplot_tutorial.html
(but we should also take the example with the cute multicolored bubbles from the newer tutorial, even though it does not work on matplotlib 2.0.0 which comes with ubuntu 16.04. That example is at: https://matplotlib.org/tutorials/introductory/pyplot.html)
Then we can dip in to some of the further material at:
https://matplotlib.org/tutorials/index.html
in particular we can take a tour of the “Sample Plots in Matplotlib”.
After doing this we can try some exercises:
Redo all the work from earlier sections in this chapter using matplotlib instead of gnuplot.
Make all the programs in the previous exercise take an optional command line option which is an output file for the plot. If there is no command line option then make the plot interactive.
3.4. A histogram snippet to conclude
Finally: I will reproduce here a useful snippet from the tour: how to make and plot histograms using numpy and matplotlib.
import numpy as np
import matplotlib.pyplot as plt
## [...] collect quantity to be histogrammed in an array x
n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75)
plt.xlabel('base quantity')
plt.ylabel('Probability')
plt.title('Histogram of base quantity')
plt.grid(True)
plt.show()