7. Fitting functions to data

[status: content_complete_needs_review]

7.1. Motivation, Prerequisites, Plan

In Section 4 we looked at how to take a function and represent it as points and lines in a plot. Here we do the opposite: if we are given a collection of \((x, y)\) points we try to find what kind of function might have generated those points.

There are so many types of functions that there is some artistry involved in picking which kind of function to fit to a set of data.

In many cases we will want to fit a straight line to our data. Sometimes it will be a polynomial. Our intuition on what functions to try to fit can come from two sources: (a) looking at the data visually (“hey! that looks like a parabola!”), and (b) having some intuition about the process that underlies that data (“should be an exponential decay; let’s try fitting an exponential function!”)

Terminology: the techniques we use here are sometimes referred to as regression, linear regression, fitting, curve fitting, line fitting, least squares fitting, …

7.2. Examples to get started

Let us start by visualizing some data sets. We’ll start with the dataset we had for age, height and weight of the !Kung people of the Kalahari desert. We downloaded and analyzed this data set back in Section 3.1, but there we were looking at height and weight historgrams. Here we look at something simpler: height versus age.

In this example we will grab the data set,

$ wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
$ gnuplot
gnuplot> set datafile separator ";"
gnuplot> plot 'Howell1.csv' using 3:1 with points

You can carry it all out with the following instructions:

Listing 7.2.1 Instructions to plot the height vs. age for the !Kung.

##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot 'Howell1.csv' using 3:1 with points

and you should get the figure in Figure 7.2.1.

../_images/plot-height.svg — Figure 7.2.1 Height vs. age for the !Kung people of the Kalahari desert. Note that we have two separate behaviors of the height data: one for before the age of 18 (rapid growth), the other for after the age of 20 (mostly norizontal).

We see this drastic difference between lower and higher age. One thing that comes to mind is that the mechanisms that cause this are obvious biological ones: we grow fast when we are young, then we stop growing taller.

The rising height part of the graph (up to age 18) seems to be reasonably close to a straight line, as is the top part (from 20 years on). The entire plot does not at all look like a straight line.

Let us look at those separate areas in separate plots:

Listing 7.2.2 Zooming in on the straight line between ages 2 and 18.

##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot [2:20] 'Howell1.csv' using 3:1 with points

and you should get the figure in Figure 7.2.2.

../_images/plot-height-2-18.svg — Figure 7.2.2 Zooming in on the age 2-18 region of the height plot.

7.3. Straight line fits

7.3.1. Our goal

Let us state a bold goal: can we find a straight line that seems to fit the points in the rising area of the height plot of the !Kung people (Figure 7.2.2)?

Remember from Section 4.2 that a straight line looks like \(y = m x + b\) is described by two parameters: the slope \(m\) and the y intercept \(b\), so another way to phrase this goal is:

Given a collection of points \((x_i, y_i)\), find the slope and intercept for the line \(y = m x + b\) which most closely fits the points.

You could do this visually: print it, take a ruler, place it so that it runs through the points in the plot, and draw a line. The result is not really optimal, so we look for better techniques.

7.3.2. Stepping back: just two points

Let us start with just two points, one for a 3-year-old and the other for a 17-year-old. From the data file we can pick the (age, height) pairs: (3, 96.52) and (17, 142.875).

In middle school math we learn how to find the line that goes through these points. We write out the equation \(y = m x + b\) using the specific \((x, y)\):

(7.3.1)\[\begin{split}96.52 & = m \times 3 + b \\ 142.875 & = m \times 17 + b\end{split}\]

By subtracting the equations we get:

\[\begin{split}142.875 & - 96.52 = m \times 17 - m \times 3 = m \times 14 \\ & \implies m = (142.875 - 96.52) / 14 = 3.311\end{split}\]

and substituting back into the first equation we get:

\[\begin{split}96.52 & = 3.311 \times 3 + b \\ & \implies b = 96.52 - 3 \times 3.3111 = 86.5867\end{split}\]

So the equation for our line is:

\[\begin{split}m & = 3.311 \\ b & = 86.5867 \\ y & = 3.311 \times x + 86.5867\end{split}\]

Note

This is just a first stab at it! We do not know if we chose those two points well – in fact it seems from this picture that the 17-year-old we picked was shorter than average, which would skew the results. The 3-year-old is also taller than average. This example was just to get going so that we can talk about line fits. In Section 7.4 we will do proper line fitting.

7.3.3. Let’s plot that line with our data

Now that we have a line fit we should feel really excited about plotting it together with our data, so that we can get some visual satisfaction. I always encourage people to do this right away.

Let us take the plotting instructions in Listing 7.2.1 and add to it the plotting of our fitted line:

Listing 7.3.1 Height vs. age and a line fit for ages 2-18.

##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget --continue https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set key right bottom
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot [0:26] 'Howell1.csv' using 3:1 with points, \
     [2:18] 3.311 * x + 86.5867 lw 6 title "two point fit"

and you should get Figure 7.3.1.

../_images/plot-height-with-fit.svg — Figure 7.3.1 Height vs. age for the !Kung people of the Kalahari desert, with a line fit for ages 2-18.

7.3.4. Physical interpretation of the line fit

One realy cool thing we learn about fitting functions to data is that those functions have a physical interpretation! This gets scientists really excited, so I will mention it before we move on to a better procedure for fitting curves.

If you look closely at the units of measure in our scatter plot you see that we are plotting height (a measure of length) versus age (a measure of time). This means that the slope \(m\) is measured in units of length divided by time, in this case in centimeters per year. So the slope we found of 3.311 should really be reported as 3.311 centimeters/year, and it tells us how much children grow per year between the ages of 2 and 18.

Now let us look at the intercept \(b\) (in our case 86.5867). This number tels us what the value of y (height) is at time zero (birth). This would seem to imply that babies are born some 87cm tall.

Important

But wait! Babies are not born 87cm tall. They are born around 50cm tall. Then why did our straight line fit report that height at birth?? This is explained by remembering the original plot: in Figure 7.2.1 we see that the height forms a straight line between the ages of 2 and 18, but not for children less than 2. Those children grow much faster, so the straight line is not a valid approximation for infants. This means that, sadly, the intercept \(b\) does not give a gratifying physical interpretation in this case.

7.4. Proper line fitting

We’ve seen how to pick two points from our data set and make a line that goes through them, but this approach has some real problems: if I had chosen a very tall 3-year-old and a very short 17-year-old, then the slope would have been much smaller. I could have also skewed it the other way by picking a very short 3-year-old and a very tall 17-year-old, which would have given a much steeper slope. [FIXME: write exercise to do all of this]

So what is a good objective way of finding the “best fit”? This is much debated, and one get get quite subtle and make a real profession of the line fitting business, but for our purposes we will use the most comon approach. It is called “least squares fitting”.

Look at Figure 7.4.1:

../_images/Linear_least_squares_example2.svg — Figure 7.4.1 From wikipedia, https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics) A plot of the data points (in red), the least squares line of best fit (in blue), and the residuals (in green).

You see four points and an example of a line that tries to go through those four points. How well did it succeed?

Look at those green lines in the figure: they show how far the point is from the line, and they give a measure of the error in your fit. If you had chosen a radically different blue line, then the sum of all those green lines might be quite big, and you would have a poor fit.

The idea behind “least squares” is that you want to minimize the sum of the squares of those errors.

There is a lot going on here. You might immediately ask at least two questions: (a) why the sum of the squares of the errors? (b) how do you minimize?

For now I will give a very brief answer to the first question: squares are often considered “nicer” functions than just taking the distance itself. You can see this with this plot:

$ gnuplot
gnuplot> plot [-3:3] x*x
gnuplot> replot abs(x)

both the lines tend to grow bigger when your error is bigger, but the smooth one is more suited to various mathematical techniques.

As for the second question: we will minimize the sum of the squares of the errors using techniques from calculus. I don’t describe them here, but I will show a cool pair of equations. First wrap your head around this expression:

(7.4.1)\[E = \sum_{i=1}^N (m x_i + b - y_i)^2\]

I like formulae like this one because that capital greek sigma letter is visually appealing, but what does it all mean?

First note that in math we use the symbol \(\sum_{i=1}^N\) to mean “take a sum of all these terms, where the letter i will be replaced in turn by 1, 2, 3, … up to N.

Now let’s say you have N data points \((x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\). Then the E in equation (7.4.1) is the sum of a bunch of terms, each of which looks like \((m x_i + b - y_i)^2\). That is difference between “what the fitted line would have given you” and “the real data point that you have”. Squared. These differences are often called the “residuals”.

So E is a pretty good measure of how poor your line fit is, so if you find the values of m and b that make E as small as possible then they might give you a nice straight line fit through your data.

To show you some more cool math typesetting, and to entice you to study calculus, I will show you the equations that are written to solve this problem:

(7.4.2)\[\begin{split}E & = \sum_{i=1}^N (m x_i + b - y_i)^2 \\ \frac{\partial E}{\partial m} & = 0 \\ \frac{\partial E}{\partial b} & = 0\end{split}\]

This gives you two equations which you can solve to get m and b. We will not go through the details of it since it is calculus material (“finding minima using derivatives”), but we will now learn to use a Python library that does this calculation for us.

Note

There are many pleasing aspects to learning about least squares fitting. One of them is that this is one of the earliest places where you run in to … FIXME

7.5. Using Python’s scientific libraries to fit lines

Python comes with an extensive scientific library called scipy. Scipy has a statistics subpackage, which in turn has a function called linregress() which does all that work for us.

Enter the program in

Listing 7.5.1 fit-height.py – fit a line through the height data and print out the slope and intercept.

#! /usr/bin/env python3

import sys
from scipy.stats import linregress

def main():
    if len(sys.argv) != 4:
        print('error: wrong number of arguments')
        print('usage: %s filename min_age max_age' % sys.argv[0])
        print('example: %s Howell1.csv 2 18' % sys.argv[0])
        sys.exit(1)
    fname = sys.argv[1]
    lowest_age = int(sys.argv[2])
    highest_age = int(sys.argv[3])
    
    xdata, ydata = load_file(fname, lowest_age, highest_age)
    slope, intercept, r_value, p_value, std_error = linregress(xdata, ydata)
    print('the least squares fit returns slope, intercept:')
    print('m =', slope, ', b =', intercept)


## load columns 2 and 0 from the file, return two data vectors.  only
## pick out ages betwen min_age and max_age (inclusive)
def load_file(fname, min_age, max_age):
    xdata = []
    ydata = []
    f = open(fname, 'r')
    for line in f.readlines()[1:]:
        words = line.split(';')
        x, y = float(words[2]), float(words[0])
        if x >= min_age and x <= max_age:
            xdata.append(x)
            ydata.append(y)
    f.close()
    return xdata, ydata

main()

Run the program with:

$ ./fit-height.py  Howell1.csv 2 18
the least squares fit returns slope, intercept:
m = 3.99026836995 , b = 79.6805438775

Now compare these values to those found in in Section 7.3.2 where we just picked two points to work with. It turns out that 3.99 is not too far from 4.12, and 79.68 is not far from 79.06, but they are different. Let us update Figure 7.3.1 to also show the line found by least squares linear regression:

Listing 7.5.2 Height vs. age, a naive line fit and a least squares fit for ages 2-18.

##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget --continue https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set key right bottom
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot [0:26] 'Howell1.csv' using 3:1 with points, \
    [2:18] 3.311 * x + 86.5867 lw 4 title "two point fit", \
    [2:18] 3.99026836995 * x + 79.6805438775 lw 4 title "least squares fit"

and you should get Figure 7.5.1.

../_images/plot-height-with-two-fits.svg — Figure 7.5.1 Height vs. age for the !Kung people of the Kalahari desert, with a naive line fit for ages 2-18, and a least squares fit as well. Note the difference between the two lines - the steeper line from the least squares fit is a better fit for the age 2-18 range.

This line fit from the least squares method was a much better fit to the 2-18 year old data, so we can now say that the !Kung youth grow approximately 3.99 centimeters/year.

7.6. When to not try a linear fit

We should only try linear fits if we have reason to believe that the data is linear (either visually or from the scientific mechanism that underlies the data).

In the case of our height plot, what would happen if we were to try to fit the entire plot, from age 0 to the maximum age?

Let us try and see what it looks like. We run our program with ages 0 and 90:

$ ./fit-height.py  Howell1.csv 0 90
the least squares fit returns slope, intercept:
m = 0.909605221912 , b = 111.571782869

What would this line look like? Plotting like this:

Listing 7.6.1 Gnuplot instructions to plot an inappropriate line fit.

##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget --continue https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set key right bottom
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot 'Howell1.csv' using 3:1 with points, \
     0.909605221912 * x + 111.571782869 lw 4 title "least squares line fit"

and you should get the plot in Figure 7.6.1.

../_images/plot-height-with-inappropriate-fit.svg — Figure 7.6.1 Height vs. age for the !Kung people of the Kalahari desert. We also plot a fit based on the entire plot, and we see that it is a poor fit to the data, since the data is not linear.

We see that the mechanisms that determine height are quite different for infants, youths and older people, so there is no single line fit.

7.7. Fitting curves

7.7.1. Polynomial fits

Let us look at a falling body. In fact let us look at the first careful measurement that was ever made of a falling body. In the mid-17th-century Italian astronomer Riccioli [Gra12] carried out careful experiments from a tower in Bologna and got the data in this table:

Table 7.7.1 Riccioli’s acceleration data
time	distance
0.833	10
1.66	40
2.50	90
3.33	160
4.17	250
1	15
2	60
3	135
4	240
4.33	280
1.08	18
2.17	72
3.25	162
4.33	280

You can download a CSV format file for this data: riccioli-table.csv

Newton’s law with a constant acceleration of gravity \(-g\) tells us that the position reached by a falling body from height \(h_0\) (280 meters), with initial velocity \(v_0 = 0\) is:

(7.7.1)\[y(t) = h_0 + v_0 t - \frac{1}{2} g t^2 = h_0 - \frac{1}{2} g t^2\]

When we calculate the coefficients for (7.7.1) we expect the first degree coefficient (initial velocity \(v_0\)) to be close to zero since we are letting it drop rather than tossing it. We also expect the constant term (initial height \(h_0\)) to be close to 280 roman feet, the height from which we drop it.

So we are interested in using the data from Riccioli’s mid-17th-century experiment that we put in file riccioli-table.csv:

Listing 7.7.1 fit-falling-body.py - fit a 2^nd degree polynomial to the data, then plot the data and the fit.

#! /usr/bin/env python3

import sys
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 10))

def main():
    if len(sys.argv) not in (2, 3):
        print(f'**error** usage: {sys.argv[0]} datafilename.csv [poly_deg]')
        sys.exit(1)
    input_fname = 'riccioli-table.csv'
    poly_degree = 2             # default is to fit a parabola
    if len(sys.argv) == 3:
        poly_degree = int(sys.argv[2])
    
    # use numpy's loadtxt() function to read the data columns
    t, y = np.loadtxt(input_fname, delimiter=',', 
                      skiprows=1, unpack=True)
    # in our file the y values were listed as growing from 0, but we want
    # them to fall from the top, so we subtract them from 280 roman feet
    # get the polynomial fit
    y = 280 - y
    ## now do the polynomial fit
    print(f'fitting a {poly_degree} degree polynomial to the points')
    params = np.polyfit(t, y, poly_degree)
    print('polynomial parameters:', params)
    print('acceleration of gravity in roman feet/sec^2:', -2*params[0])
    ## note: in Riccioli's calculation one roman foot is probably 0.301
    ## meters
    print('acceleration of gravity in meters/sec^2:', -2*params[0]*0.301)
    plt.scatter(t, y, label='Data')
    ## now show the polynomial
    pfunc = np.poly1d(params)
    t_regular = np.arange(np.min(t), np.max(t+0.01), 0.01)
    y_from_poly = pfunc(t_regular)
    plt.title("Riccioli's falling body data")
    plt.plot(t_regular, y_from_poly, label=f'degree {poly_degree} polynomial fit')
    ## show the plots
    plt.xlabel('time')
    plt.ylabel('height (roman feet)')
    plt.legend(loc='best')
    ## you can call plt.show() to view it interactively, or plt.savefig()
    ## to save image files
    #plt.show()
    assert(input_fname[-4:] == '.csv') # it should be a .csv file
    outfname_base = input_fname[:-4] + f'_deg{poly_degree}'
    save_plot_to_file(outfname_base, 'pdf')
    save_plot_to_file(outfname_base, 'svg')

def save_plot_to_file(outfname_base, fmt_suffix):
    out_fname = f'{outfname_base}.{fmt_suffix}'
    plt.savefig(out_fname)
    print(f'saved plot to file {out_fname}')

if __name__ == '__main__':
    main()

You can run the program with:

chmod +x fit-falling-body.py
./fit-falling-body.py riccioli-table.csv

It will find that the polynomial coefficients \((a, b, c)\) are -14.6918224, 0.62669318, 280.40412405 and the polynomial fit looks like:

(7.7.2)\[y(t) = y = -14.69182224 \times t^2 - 0.62669318 t + 280.40412405\]

The program saves the output in files riccioli-table_deg2.pdf and riccioli-table_deg2.svg, and we show it in Figure 7.7.1.

../_images/riccioli-table_deg2.svg — Figure 7.7.1 Position as a function of time for a falling body. The data is taken from Riccioli’s classic experiment in the mid 17th century. The line is a 2^nd degree polynomial fit of the data.

The coefficients for the polynomial in equation (7.7.2) and Figure 7.7.1 teach us that:

The 2^nd degree polynomial is a good fit to the data.
The coefficients in the falling body equation (7.7.1) have physical meanings. The constant term \(h_0\) is the height from which we drop the body, \(v_0\) is the initial velocity, and in the \(\frac{1}{2} g t^2\) term, g is the acceleration of gravity.
In our fit we have a very small value for \(v_0\) (close to 0), which matches the physical situation.
In our fit we have \(h_0\) very close to 280 roman feet, which matches the physical situation.
In our fit we have \(g\) close to \(29.38 \; {\rm roman \; feet}/s^2\). In modern units that gives us \(g = 8.84 \; m/s^2\), which is about 10% off of the current measured value of \(9.81 \; m/s^2\).

https://physicstoday.scitation.org/doi/pdf/10.1063/PT.3.1716

https://physicstoday.scitation.org/doi/full/10.1063/PT.3.1716

7.7.2. Overfitting

We all know that if you have two (distinct) points you can fit a unique straight line through them. If you write the equation for that straight line as \(y = m x + b\) you might find yourself thinking “hmm, there were two points that constrained that line, and there are two parameters \((m, b)\) (slope and intercept).

So two points gave unique values to two parameters. Is that “2” a coincidence?

It turns out that if you have three points in a plane, and they are not on a line together, then you can fit a unique parabola (2^nd degree polynomial) through them. If that parabola looks like \(y = a x^2 + b ^x + c\), then you will notice that there are three parameters \((a, b, c)\) being fixed by the three points that the curve must visit.

You might ask if you have 7 points, is there exactly one 6th degree polynomial that fits through all 7 points? The answer is yes, unless those 7 points are somewhat contrived. For example, if they all lie perfectly on a straight line, or on a parabola, or other lower order polynomial, then you will have multiple possible 6th degree polynomials that go through those pointes.

So you might ask the question:

Note

Do I get better fidelity to the data if I fit a very high degree polynomial to the data? For example, if I take the Riccioli data set, and fit a 9^th degree polynomial, will that capture some deep nuance in the data?

The answer is a clear no, and we will demonstrate that by trying it out and visualizing the results. After visualizing, we will try to gain some insight.

Start by modifying the program fit-falling-body.py to remove some of the data. It has 14 data points, so we could fit up to a 13^th degree polynomial. We will do a few experiments by calling the program with arguments different from two. Start with:

./fit-falling-body.py riccioli-table.csv 13

This should give you a plot like the one shown here:

where you can clearly see that the 2nd degree polynomial makes more sense than the 13th degree polynomial.

So what just happened? The 13th degree polynomial is a perfect fit: it passes exactly through every point in the data set, so why is it so clearly bogus?

This is a good topic to discuss in class, and it leads to discussions of “overfitting”.

7.7.3. Fitting arbitrary functions

Curve fitting is described here:

https://www.scipy-lectures.org/intro/scipy/auto_examples/plot_curve_fit.html

The equation we fit comes from Zhang, Jiang, Zhang, Liu, Wang and Loh “A Generalized SOC-OCV Model for Lithium-Ion Batteries and the SOC estimation for LNMCO Battery”. [ZJZ+16]

\[V_{\rm OCV}(s) = a + b (-\log_e s)^m + c s + d e^{n (s-1)}\]

where \(s\) is the “state of chage” (SOC) - how much charge the battery still has.

I found that with the data sets I had I needed to adapt it slightly and used:

\[V_{\rm OCV} = a + b s^{-m} + c s + d e^{n (s-1)}\]

Each term in this equation corresponds to a different chemical effect in the battery’s journey.

Listing 7.7.2 Battery discharge data: closed and open circuit voltage (volts) versus time (hours) for the discharge of two consumer-grade AA batteries in series. Source: Abby Wilson, private communication.

#time closed open
   2.48   2.74
   2.33   2.59
   2.29   2.55
   2.28   2.54
5   2.27   2.52
5   2.25   2.50
5   2.23   2.48
   2.21   2.46
   2.01   2.28
5   1.84   2.10
75  0.80   1.90

Run this program with:

$ ./fit-battery.py battery-discharge.dat

to get interactive graphical output, or to get the plot in a file you can use something like:

$ ./fit-battery.py battery-discharge.dat battery-discharge.pdf

The resulting plot should look like that in Figure 7.7.2.

../_images/battery-discharge.svg — Figure 7.7.2 A curve fit for the battery discharge data.

The parameters found were:

a = 2.53651
b = 0.0512756
c = -0.182952
d = -0.499311
m = 0.151882
n = 15.5123

So the fitted function looks like:

\[V_{\rm OCV}(s) = 2.54 + 0.051 s^{-0.152} - 0.183 s - 0.499 e^{15.5 (s-1)}\]

There are many assumptions I made here. For example, I assumed that (1 - SOC) is linearly proportional to time as you discharge. That’s not quite correct. But this is still an interesting example of how you can take a rather complicated function with various parameters (6 in this case) and fit it to a set of data.

7.8. Topics for further study

7.8.1. Interpolation and extrapolation

What do interpolation and extrapolation mean? Show an example where it works well, like with falling bodies.

Discuss what is a predictive model.

7.8.2. How high should the degree of the polynomial be?

Discuss overfitting and crazy polynomial behaviors.

Discuss “fitting unerlying truth” versus “fitting noise”.

Discuss a physical reason for having a certain degree of polynomial (or a certain functional form), and what happens when you don’t have that physical reason.

Discuss objective criteria for limiting the number of dimensions to avoid crazy polynomial behaviors.

Mention approaches that use functions other than polynomials, like Fourier analysis.