# 7. Fitting functions to data

[status: content_complete_needs_review]

## 7.1. Motivation, Prerequisites, Plan

In Section 4 we looked at how to take a
function and represent it as points and lines in a plot. Here we do
the opposite: if we are given a collection of \((x, y)\) points we
try to find *what kind of function might have generated those points*.

There are so many types of functions that there is some artistry involved in picking which kind of function to fit to a set of data.

In many cases we will want to fit a straight line to our data.
Sometimes it will be a polynomial. Our intuition on what functions to
try to fit can come from two sources: (a) looking at the data visually
(“hey! that looks like a parabola!”), and (b) having some intuition
about the *process* that underlies that data (“should be an
exponential decay; let’s try fitting an exponential function!”)

Terminology: the techniques we use here are sometimes referred to as regression, linear regression, fitting, curve fitting, line fitting, least squares fitting, …

## 7.2. Examples to get started

Let us start by visualizing some data sets. We’ll start with the dataset we had for age, height and weight of the !Kung people of the Kalahari desert. We downloaded and analyzed this data set back in Section 3.1, but there we were looking at height and weight historgrams. Here we look at something simpler: height versus age.

In this example we will grab the data set,

```
$ wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
$ gnuplot
gnuplot> set datafile separator ";"
gnuplot> plot 'Howell1.csv' using 3:1 with points
```

You can carry it all out with the following instructions:

```
##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot 'Howell1.csv' using 3:1 with points
```

and you should get the figure in Figure 7.2.1.

We see this drastic difference between lower and higher age. One thing that comes to mind is that the mechanisms that cause this are obvious biological ones: we grow fast when we are young, then we stop growing taller.

The rising height part of the graph (up to age 18) seems to be reasonably close to a straight line, as is the top part (from 20 years on). The entire plot does not at all look like a straight line.

Let us look at those separate areas in separate plots:

```
##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot [2:20] 'Howell1.csv' using 3:1 with points
```

and you should get the figure in Figure 7.2.2.

## 7.3. Straight line fits

### 7.3.1. Our goal

Let us state a bold goal: can we find a straight line that seems to fit the points in the rising area of the height plot of the !Kung people (Figure 7.2.2)?

Remember from Section 4.2 that a straight line
looks like \(y = m x + b\) is described by two parameters: the
*slope* \(m\) and the *y intercept* \(b\), so another way to
phrase this goal is:

Given a collection of points \((x_i, y_i)\), find the slope and intercept for the line \(y = m x + b\) which most closely fits the points.

You could do this visually: print it, take a ruler, place it so that it runs through the points in the plot, and draw a line. The result is not really optimal, so we look for better techniques.

### 7.3.2. Stepping back: just two points

Let us start with just two points, one for a 3-year-old and the other for a 17-year-old. From the data file we can pick the (age, height) pairs: (3, 96.52) and (17, 142.875).

In middle school math we learn how to find the line that goes through these points. We write out the equation \(y = m x + b\) using the specific \((x, y)\):

By subtracting the equations we get:

and substituting back into the first equation we get:

So the equation for our line is:

Note

This is just a first stab at it! We do not know if we chose those two points well – in fact it seems from this picture that the 17-year-old we picked was shorter than average, which would skew the results. The 3-year-old is also taller than average. This example was just to get going so that we can talk about line fits. In Section 7.4 we will do proper line fitting.

### 7.3.3. Let’s plot that line with our data

Now that we have a line fit we should feel really excited about plotting it together with our data, so that we can get some visual satisfaction. I always encourage people to do this right away.

Let us take the plotting instructions in Listing 7.2.1 and add to it the plotting of our fitted line:

```
##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget --continue https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set key right bottom
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot [0:26] 'Howell1.csv' using 3:1 with points, \
[2:18] 3.311 * x + 86.5867 lw 6 title "two point fit"
```

and you should get Figure 7.3.1.

### 7.3.4. Physical interpretation of the line fit

One realy cool thing we learn about fitting functions to data is that
those functions have a *physical interpretation*! This gets
scientists really excited, so I will mention it before we move on to a
better procedure for fitting curves.

If you look closely at the *units of measure* in our scatter plot you
see that we are plotting *height* (a measure of length) versus *age*
(a measure of time). This means that the slope \(m\) is measured
in units of length divided by time, in this case in centimeters per
year. So the slope we found of 3.311 should really be reported as
3.311 centimeters/year, and it tells us how much children grow per
year between the ages of 2 and 18.

Now let us look at the intercept \(b\) (in our case 86.5867). This number tels us what the value of y (height) is at time zero (birth). This would seem to imply that babies are born some 87cm tall.

Important

But wait! Babies are not born 87cm tall. They are born around
50cm tall. Then why did our straight line fit report that height
at birth?? This is explained by remembering the original plot: in
Figure 7.2.1 we see that the height forms a straight line
between the ages of 2 and 18, but *not* for children less than 2.
Those children grow much faster, so the straight line is not a
valid approximation for infants. This means that, sadly, the
intercept \(b\) does not give a gratifying physical
interpretation in this case.

## 7.4. Proper line fitting

We’ve seen how to pick two points from our data set and make a line that goes through them, but this approach has some real problems: if I had chosen a very tall 3-year-old and a very short 17-year-old, then the slope would have been much smaller. I could have also skewed it the other way by picking a very short 3-year-old and a very tall 17-year-old, which would have given a much steeper slope. [FIXME: write exercise to do all of this]

So what is a good objective way of finding the “best fit”? This is much debated, and one get get quite subtle and make a real profession of the line fitting business, but for our purposes we will use the most comon approach. It is called “least squares fitting”.

Look at Figure 7.4.1:

You see four points and an example of a line that *tries* to go
through those four points. How well did it succeed?

Look at those green lines in the figure: they show how far the point is from the line, and they give a measure of the error in your fit. If you had chosen a radically different blue line, then the sum of all those green lines might be quite big, and you would have a poor fit.

The idea behind “least squares” is that you want to *minimize* the sum
of the *squares* of those errors.

There is a lot going on here. You might immediately ask at least two
questions: (a) why the sum of the *squares* of the errors? (b) how do
you minimize?

For now I will give a very brief answer to the first question: squares are often considered “nicer” functions than just taking the distance itself. You can see this with this plot:

```
$ gnuplot
gnuplot> plot [-3:3] x*x
gnuplot> replot abs(x)
```

both the lines tend to grow bigger when your error is bigger, but the smooth one is more suited to various mathematical techniques.

As for the second question: we will minimize the sum of the squares of the errors using techniques from calculus. I don’t describe them here, but I will show a cool pair of equations. First wrap your head around this expression:

I like formulae like this one because that capital greek sigma letter is visually appealing, but what does it all mean?

First note that in math we use the symbol \(\sum_{i=1}^N\) to mean “take a sum of all these terms, where the letter i will be replaced in turn by 1, 2, 3, … up to N.

Now let’s say you have N data points \((x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\). Then the E in equation (7.4.1) is the sum of a bunch of terms, each of which looks like \((m x_i + b - y_i)^2\). That is difference between “what the fitted line would have given you” and “the real data point that you have”. Squared. These differences are often called the “residuals”.

So E is a pretty good measure of how poor your line fit is, so if you
find the values of m and b that *make E as small as possible* then
they might give you a nice straight line fit through your data.

To show you some more cool math typesetting, and to entice you to study calculus, I will show you the equations that are written to solve this problem:

This gives you two equations which you can solve to get m and b. We will not go through the details of it since it is calculus material (“finding minima using derivatives”), but we will now learn to use a Python library that does this calculation for us.

Note

There are many pleasing aspects to learning about least squares fitting. One of them is that this is one of the earliest places where you run in to … FIXME

## 7.5. Using Python’s scientific libraries to fit lines

Python comes with an extensive scientific library called *scipy*.
Scipy has a statistics subpackage, which in turn has a function called
`linregress()`

which does all that work for us.

Enter the program in

```
#! /usr/bin/env python3
import sys
from scipy.stats import linregress
def main():
if len(sys.argv) != 4:
print('error: wrong number of arguments')
print('usage: %s filename min_age max_age' % sys.argv[0])
print('example: %s Howell1.csv 2 18' % sys.argv[0])
sys.exit(1)
fname = sys.argv[1]
lowest_age = int(sys.argv[2])
highest_age = int(sys.argv[3])
xdata, ydata = load_file(fname, lowest_age, highest_age)
slope, intercept, r_value, p_value, std_error = linregress(xdata, ydata)
print('the least squares fit returns slope, intercept:')
print('m =', slope, ', b =', intercept)
## load columns 2 and 0 from the file, return two data vectors. only
## pick out ages betwen min_age and max_age (inclusive)
def load_file(fname, min_age, max_age):
xdata = []
ydata = []
f = open(fname, 'r')
for line in f.readlines()[1:]:
words = line.split(';')
x, y = float(words[2]), float(words[0])
if x >= min_age and x <= max_age:
xdata.append(x)
ydata.append(y)
f.close()
return xdata, ydata
main()
```

Run the program with:

```
$ ./fit-height.py Howell1.csv 2 18
the least squares fit returns slope, intercept:
m = 3.99026836995 , b = 79.6805438775
```

Now compare these values to those found in in Section 7.3.2 where we just picked two points to work with. It turns out that 3.99 is not too far from 4.12, and 79.68 is not far from 79.06, but they are different. Let us update Figure 7.3.1 to also show the line found by least squares linear regression:

```
##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget --continue https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set key right bottom
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot [0:26] 'Howell1.csv' using 3:1 with points, \
[2:18] 3.311 * x + 86.5867 lw 4 title "two point fit", \
[2:18] 3.99026836995 * x + 79.6805438775 lw 4 title "least squares fit"
```

and you should get Figure 7.5.1.

This line fit from the least squares method was a much better fit to the 2-18 year old data, so we can now say that the !Kung youth grow approximately 3.99 centimeters/year.

## 7.6. When to not try a linear fit

We should only try linear fits if we have reason to believe that the data is linear (either visually or from the scientific mechanism that underlies the data).

In the case of our height plot, what would happen if we were to try to
fit the *entire* plot, from age 0 to the maximum age?

Let us try and see what it looks like. We run our program with ages 0 and 90:

```
$ ./fit-height.py Howell1.csv 0 90
the least squares fit returns slope, intercept:
m = 0.909605221912 , b = 111.571782869
```

What would this line look like? Plotting like this:

```
##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget --continue https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set key right bottom
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot 'Howell1.csv' using 3:1 with points, \
0.909605221912 * x + 111.571782869 lw 4 title "least squares line fit"
```

and you should get the plot in Figure 7.6.1.

We see that the mechanisms that determine height are quite different for infants, youths and older people, so there is no single line fit.

## 7.7. Fitting curves

### 7.7.1. Polynomial fits

Let us look at a falling body. In fact let us look at the first careful measurement that was ever made of a falling body. In the mid-17th-century Italian astronomer Riccioli [Gra12] carried out careful experiments from a tower in Bologna and got the data in this table:

time |
distance |
---|---|

0.833 |
10 |

1.66 |
40 |

2.50 |
90 |

3.33 |
160 |

4.17 |
250 |

1 |
15 |

2 |
60 |

3 |
135 |

4 |
240 |

4.33 |
280 |

1.08 |
18 |

2.17 |
72 |

3.25 |
162 |

4.33 |
280 |

You can download a CSV format file for this data:
`riccioli-table.csv`

Newton’s law with a constant acceleration of gravity \(-g\) tells us that the position reached by a falling body from height \(h_0\) (280 meters), with initial velocity \(v_0 = 0\) is:

When we calculate the coefficients for (7.7.1) we expect the first degree coefficient (initial velocity \(v_0\)) to be close to zero since we are letting it drop rather than tossing it. We also expect the constant term (initial height \(h_0\)) to be close to 280 roman feet, the height from which we drop it.

So we are interested in using the data from Riccioli’s
mid-17th-century experiment that we put in file
`riccioli-table.csv`

:

```
#! /usr/bin/env python3
import sys
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 10))
def main():
if len(sys.argv) not in (2, 3):
print(f'**error** usage: {sys.argv[0]} datafilename.csv [poly_deg]')
sys.exit(1)
input_fname = 'riccioli-table.csv'
poly_degree = 2 # default is to fit a parabola
if len(sys.argv) == 3:
poly_degree = int(sys.argv[2])
# use numpy's loadtxt() function to read the data columns
t, y = np.loadtxt(input_fname, delimiter=',',
skiprows=1, unpack=True)
# in our file the y values were listed as growing from 0, but we want
# them to fall from the top, so we subtract them from 280 roman feet
# get the polynomial fit
y = 280 - y
## now do the polynomial fit
print(f'fitting a {poly_degree} degree polynomial to the points')
params = np.polyfit(t, y, poly_degree)
print('polynomial parameters:', params)
print('acceleration of gravity in roman feet/sec^2:', -2*params[0])
## note: in Riccioli's calculation one roman foot is probably 0.301
## meters
print('acceleration of gravity in meters/sec^2:', -2*params[0]*0.301)
plt.scatter(t, y, label='Data')
## now show the polynomial
pfunc = np.poly1d(params)
t_regular = np.arange(np.min(t), np.max(t+0.01), 0.01)
y_from_poly = pfunc(t_regular)
plt.title("Riccioli's falling body data")
plt.plot(t_regular, y_from_poly, label=f'degree {poly_degree} polynomial fit')
## show the plots
plt.xlabel('time')
plt.ylabel('height (roman feet)')
plt.legend(loc='best')
## you can call plt.show() to view it interactively, or plt.savefig()
## to save image files
#plt.show()
assert(input_fname[-4:] == '.csv') # it should be a .csv file
outfname_base = input_fname[:-4] + f'_deg{poly_degree}'
save_plot_to_file(outfname_base, 'pdf')
save_plot_to_file(outfname_base, 'svg')
def save_plot_to_file(outfname_base, fmt_suffix):
out_fname = f'{outfname_base}.{fmt_suffix}'
plt.savefig(out_fname)
print(f'saved plot to file {out_fname}')
if __name__ == '__main__':
main()
```

You can run the program with:

```
chmod +x fit-falling-body.py
./fit-falling-body.py riccioli-table.csv
```

It will find that the polynomial coefficients \((a, b, c)\) are
`-14.6918224, 0.62669318, 280.40412405`

and the polynomial fit looks
like:

The program saves the output in files `riccioli-table_deg2.pdf`

and
`riccioli-table_deg2.svg`

, and we show it in
Figure 7.7.1.

The coefficients for the polynomial in equation (7.7.2) and Figure 7.7.1 teach us that:

The 2

^{nd}degree polynomial is a good fit to the data.The coefficients in the falling body equation (7.7.1) have physical meanings. The constant term \(h_0\) is the height from which we drop the body, \(v_0\) is the initial velocity, and in the \(\frac{1}{2} g t^2\) term, g is the acceleration of gravity.

In our fit we have a very small value for \(v_0\) (close to 0), which matches the physical situation.

In our fit we have \(h_0\) very close to 280 roman feet, which matches the physical situation.

In our fit we have \(g\) close to \(29.38 \; {\rm roman \; feet}/s^2\). In modern units that gives us \(g = 8.84 \; m/s^2\), which is about 10% off of the current measured value of \(9.81 \; m/s^2\).

https://physicstoday.scitation.org/doi/pdf/10.1063/PT.3.1716

https://physicstoday.scitation.org/doi/full/10.1063/PT.3.1716

### 7.7.2. Overfitting

We all know that if you have two (distinct) points you can fit a
*unique* straight line through them. If you write the equation for
that straight line as \(y = m x + b\) you might find yourself
thinking “hmm, there were two points that constrained that line, and
there are two parameters \((m, b)\) (slope and intercept).

So two points gave unique values to two parameters. Is that “2” a coincidence?

It turns out that if you have three points in a plane, and they are
not on a line together, then you can fit a *unique* parabola (2^{nd} degree polynomial) through them. If that parabola looks
like \(y = a x^2 + b ^x + c\), then you will notice that there are
*three* parameters \((a, b, c)\) being fixed by the *three* points
that the curve must visit.

You might ask if you have 7 points, is there exactly one 6th degree
polynomial that fits through all 7 points? The answer is yes, unless
those 7 points are somewhat contrived. For example, if they all lie
perfectly on a straight line, or on a parabola, or other lower order
polynomial, then you will have *multiple* possible 6th degree
polynomials that go through those pointes.

So you might ask the question:

Note

Do I get better fidelity to the data if I fit a very high degree
polynomial to the data? For example, if I take the Riccioli data
set, and fit a 9^{th} degree polynomial, will that capture
some deep nuance in the data?

The answer is a clear *no*, and we will demonstrate that by trying it
out and visualizing the results. After visualizing, we will try to
gain some insight.

Start by modifying the program `fit-falling-body.py`

to remove some
of the data. It has 14 data points, so we could fit up to a 13^{th} degree polynomial. We will do a few experiments by calling
the program with arguments different from two. Start with:

```
./fit-falling-body.py riccioli-table.csv 13
```

This should give you a plot like the one shown here:

where you can clearly see that the 2nd degree polynomial makes more sense than the 13th degree polynomial.

So what just happened? The 13th degree polynomial is a *perfect* fit:
it passes exactly through every point in the data set, so why is it so
clearly bogus?

This is a good topic to discuss in class, and it leads to discussions of “overfitting”.

### 7.7.3. Fitting arbitrary functions

Curve fitting is described here:

https://www.scipy-lectures.org/intro/scipy/auto_examples/plot_curve_fit.html

The equation we fit comes from Zhang, Jiang, Zhang, Liu, Wang and Loh “A Generalized SOC-OCV Model for Lithium-Ion Batteries and the SOC estimation for LNMCO Battery”. [ZJZ+16]

where \(s\) is the “state of chage” (SOC) - how much charge the battery still has.

I found that with the data sets I had I needed to adapt it slightly and used:

Each term in this equation corresponds to a different chemical effect in the battery’s journey.

```
#time closed open
0 2.48 2.74
1 2.33 2.59
2 2.29 2.55
3 2.28 2.54
3.5 2.27 2.52
4.5 2.25 2.50
5.5 2.23 2.48
6 2.21 2.46
8 2.01 2.28
8.5 1.84 2.10
8.75 0.80 1.90
```

Run this program with:

```
$ ./fit-battery.py battery-discharge.dat
```

to get interactive graphical output, or to get the plot in a file you can use something like:

```
$ ./fit-battery.py battery-discharge.dat battery-discharge.pdf
```

The resulting plot should look like that in Figure 7.7.2.

The parameters found were:

```
a = 2.53651
b = 0.0512756
c = -0.182952
d = -0.499311
m = 0.151882
n = 15.5123
```

So the fitted function looks like:

There are many assumptions I made here. For example, I assumed that (1 - SOC) is linearly proportional to time as you discharge. That’s not quite correct. But this is still an interesting example of how you can take a rather complicated function with various parameters (6 in this case) and fit it to a set of data.

## 7.8. Topics for further study

### 7.8.1. Interpolation and extrapolation

What do interpolation and extrapolation mean? Show an example where it works well, like with falling bodies.

Discuss what is a *predictive* model.

### 7.8.2. How high should the degree of the polynomial be?

Discuss overfitting and crazy polynomial behaviors.

Discuss “fitting unerlying truth” versus “fitting noise”.

Discuss a physical reason for having a certain degree of polynomial (or a certain functional form), and what happens when you don’t have that physical reason.

Discuss objective criteria for limiting the number of dimensions to avoid crazy polynomial behaviors.

Mention approaches that use functions other than polynomials, like Fourier analysis.