# 2. Starting out: data files and first plots

[status: mostly-complete-needs-polishing-and-proofreading]

## 2.1. Motivation, prerequisites, plan

Data plotting and data visualization is the key

## 2.2. Very first data plots with gnuplot

Our first goal is to become comfortable with data files and with plotting. We first get the students to renew their acquaintance with creating files with an editor and make a file with some hand-crafted data.

Use your favorite editor (possibly `emacs`

for those who have taken
my previous course, but `vi`

or `gedit`

should also work) to open
a file called `simpledata.dat`

Enter two columns of simple data into this file. For example:

```
1970 16.7
1980 17.2
1990 17.5
2000 18.2
2010 19.8
2020 20.1
2030 22.7
2060 28
```

Then save it, and enter `gnuplot`

to plot this data:

```
$ sudo apt-get install gnuplot-x11
$ gnuplot
gnuplot> plot 'simpledata.dat'
```

then have the students plot the data with slightly different options in gnuplot:

```
$ gnuplot
gnuplot> plot 'simpledata.dat' with lines
gnuplot> set xlabel 'this is the "x" axis'
gnuplot> set ylabel 'this is the "y" axis'
gnuplot> plot 'simpledata.dat' with linespoints
```

## 2.3. Plotting *functions* with gnuplot

More examples of using gnuplot. We don’t assume knowledge of trigonometry from younger students, so we tell the story as “this is the sin function, which you will learn about some day; it plots these waves.”

On the other hand the polynomial \(y = x^3 - 3x\) should be within their reach: I might call on the class to tell me “what’s \(-10^3\) and \(-2^3\), \(2^3\), and \(10^3\)” - this establishes that the plot goes down on the left hand side, and up on the right hand side. The interesting play in the middle can be narrated by showing that \((1/2)^3\) is smaller than \(3 * (1/2)\), so that the negative term dominates briefly (see Figure 2.3.1)..

```
gnuplot> plot sin(x)
gnuplot> plot x*x*x - 3*x
gnuplot> plot x**3 - 3*x
## note that this last one did show a dip in the middle,
## but zooming in on the range from -3 to 3 shows the
## interesting features in the middle of the plot. This
## plot has two flat points (one local maximum and one
## local minimum), rather than one saddle point.
gnuplot> plot [-3:3] x**3 - 3*x
```

If I am getting a good mathematical vibe from the classroom I will briefly step into calculus territory and ask students to predict the local max and min for \(y = x^3 - 3x\). I will then quickly show them that

and using the quadratic formula

we get \(\pm 6 / 6 = \pm 1\), which is exactly where the local max and min are located (see Figure 2.3.1).

You can now set the grid in the plot and see if this calculation matches the plot:

```
##CAPTION: Simple line plot.
set grid
## while we're at it also show the derivative:
plot [-3:3] x**3 - 3*x, 3*x**2 - 3
```

The two plots, superimposed in Figure 2.3.1, show that where the derivative function (\(3x^2-3\)) is zero, the original function (\(x^3-3x\)) has its flat point. This can be presented, especially to older kids, in a rapid way that conveys “you don’t have to understand this, but if you have heard about slopes then note that the second curve shows the slope of the first one…” For the younger kids we can emphasize that the figure shows two functions and looks intriguing.

Visualizing functions like this is cool and it can be useful during the exploratory phase of a research project, but it seldom comes up in the bulk of the work and in the final scientific write-up. We will now move on to tasks which come up quite frequently.

## 2.4. Reading and writing files, in brief

First let us make sure we know a couple of shell commands to look at a file. Here I usually will take a portion of the board and write a boxed inset “cheat sheet” with some useful shell commands. 4

Since we already have a file called `simpledata.dat`

which we
created earlier, let us look at three shell commands that give us a
quick glance at what’s in that file: `head`

, `tail`

and `less`

.

```
$ head simpledata.dat
$ tail simpledata.dat
$ less simpledata.dat
## (when using less be sure to quit with 'q')
```

These are simple ways to peek at a file, and will work with any text file. You should always remember these commands.

Next we will look at how to read a file in a Python program. This is a
crucial pattern and we will use it a lot. Type in the program
`reader.py`

shown in Listing 2.4.1 and run it to see
what happens.

```
#! /usr/bin/env python3
"""show a simple paradigm for reading a file with two columns of
data"""
def main():
fname = 'simpledata.dat' # the file we wrote out by hand
dataset = read_file(fname)
print('I just read file %s with %d lines' % (fname, len(dataset)))
print('I will now print the first 10 lines')
N = min(10, len(dataset))
for i in range(N):
print(dataset[i])
def read_file(fname):
dataset = []
f = open(fname, 'r')
for line in f.readlines():
words = line.split()
x = float(words[0])
y = float(words[1])
dataset.append((x, y))
f.close()
return dataset
main()
```

Remember that after writing and saving the program you do the following to make it executable and then run it:

```
$ chmod +x reader.py
$ ./reader.py
```

Finally let us see how to write files to disk. We will extend to do an
easy manipulation of the file `simpledata.dat`

and then write it
back out to a new file `simpledata.dat.sums`

. This new program will
be called `simple-writer.py`

, so we need to copy it first:

```
$ cp reader.py simple-writer.py
```

and edit `simple-writer.py`

to look like
Listing 2.4.2.

```
#! /usr/bin/env python3
"""show a simple paradigm for writing a file after reading it and
adding some content to it"""
def main():
fname = 'simpledata.dat' # the file we wrote out by hand
dataset = read_file(fname)
print('I just read file %s with %d lines' % (fname, len(dataset)))
print('I will now print the first 10 lines')
for i in range(10):
print(dataset[i])
print('I will now modify the data')
summed_data = append_sums(dataset)
write_file(fname + '.sums', summed_data)
print('I wrote the modified data to %s' % (fname + '.sums'))
def read_file(fname):
dataset = []
f = open(fname, 'r')
for line in f.readlines():
words = line.split()
x = float(words[0])
y = float(words[1])
dataset.append((x, y))
return dataset
def append_sums(dataset):
sum_y = 0
summed_data = []
for pair in dataset:
sum_y = sum_y + pair[1]
triplet = (pair[0], pair[1], sum_y)
summed_data.append(triplet)
return summed_data
def write_file(fname, dataset):
f = open(fname, 'w')
for triplet in dataset:
f.write('%g %g %g\n' % triplet)
f.close()
main()
```

At this point we are comfortable with basic programming patterns for reading and writing data. This is very important: this kind of file manipulation is one of the big steps in becoming a confident programmer. This is a good time to make the kids familiar with the terminology “input/output” (I/O).

## 2.5. Generating our own data to plot

Now we look at an example of writing a python program to generate some data which we will then plot. Initially this just feels like silly data: it calculates the \(sin\) function for many opints and prints out the values. There is a reason to start with this very simple model: it will allow us (in the more advanced course on Fourier analysis) to give clear cut examples of deeper data anlysis. We will not be lazy: in the more advanced courses we will look at real signals instead of the toy ones.

Start with the `python3`

command line and let us see the few lines
of code that generate \(sin\) wave data:

```
#! /usr/bin/env python3
"""Generate a simpole sine wave"""
import math
for i in range(20):
x = i/10.0
signal = math.sin(x)
print('%g %g' % (x, signal))
```

```
0 0
0.1 0.0998334
0.2 0.198669
0.3 0.29552
0.4 0.389418
0.5 0.479426
0.6 0.564642
0.7 0.644218
0.8 0.717356
0.9 0.783327
1 0.841471
1.1 0.891207
1.2 0.932039
1.3 0.963558
1.4 0.98545
1.5 0.997495
1.6 0.999574
1.7 0.991665
1.8 0.973848
1.9 0.9463
```

Comments on this:

This just prints values to the terminal, so we limited it to some 20 values of \(x\). When we write it to a file for plotting we will do much more.

The classic

`for`

loop generates a sequence of integers, which does not do well at all for plotting: we want to space our \(x\) values much more tightly to get a nice plot, so we divide the integer`i`

by 10.0 to get finely spaced values of \(x\).

Let us now put this into a file called `simplewave-write.py`

```
#! /usr/bin/env python3
import math
def main():
out_fname = 'simplewave-write.dat'
f = open(out_fname, 'w')
for i in range(200):
x = i/10.0
signal = math.sin(x)
f.write('%g %g\n' % (x, signal))
f.close()
print('finished writing file %s' % out_fname)
main()
```

We run the program and do a quick scan of its output with:

```
$ python3 simplewave-write.py
$ ls -l simplewave-write.dat
$ head simplewave-write.dat
$ tail simplewave-write.dat
```

and when we have seen the first and last few lines of the output file we realize we can plot it (after going back to our gnuplot window) with:

```
##CAPTION: Plot the simple sin wave from a file.
set grid
plot 'simplewave-write.dat' using 1:2 with linespoints
```

Describing what this program does is rather straightforward, and it can
be compared to the gnuplot instruction `plot sin(x)`

.

At this point, at the blackboard, I will suggest to the students a couple of edits they “should have already tried on their own”:

See what happens if we don’t divide \(i/100.0\): try both

`x = i`

(for this use`range(7)`

) and some intermediate value`x = i/4.0`

(for this use`range(28)`

). Note the loss of resolution.Use python’s

`sys.argv`

to take the file name as a command line argument.

We then show a cleaner version of this program which adds some comments, makes clear what the \(sin\) period is, and uses some robust proramming paradigms such as using command line arguments.

```
#! /usr/bin/env python3
"""
Generate samples of a simple sin() wave and save them to a
file. The file has two columns of data separated by white
space. To run the program and plot its output try:
$ python3 generate-sin-cleaner.py 'sin_output.dat'
$ gnuplot
gnuplot> plot 'sin_output.dat' using 1:2 with lines
"""
import math
import sys
def main():
out_fname = 'simplewave-write.dat' # default output filename
if len(sys.argv) == 2:
out_fname = sys.argv[1]
elif len(sys.argv) > 2:
print('error: program takes at most one argument')
sys.exit(1)
f = open(out_fname, 'w')
for i in range(700):
x = i/100.0
signal = math.sin(x)
f.write('%g %g\n' % (x, signal))
f.close()
print('finished writing file %s' % out_fname)
main()
```

Note that the comments also tell you how to run a program and visualize the output. This use of comments to explain behavior is an important detail, even for small programs.

In the more advanced course on Fourier analysis we will revisit this simple program and make it generate more complex waveforms.

## 2.6. The broad landscape of plotting software

Now that we have seen some examples, let us talk broadly about plotting software, since you will soon be bombarded with people telling you about their favorites.

By now we know well that in the free software world there is often a dizzying variety of tools to do any task, with fans advocating each approach. Gnuplot is full-featured, stable, actively maintained and ubiquitous so I have chosen it, there are several other valid choices.

There are at least three main categories of plotting tools:

The “just a plotting program” kind,

The “plotting program with some data analysis that grew into a full programming language”,

The “plotting library for a well-established programming language”.

Gnuplot is clearly one of the first, R and Octave the second, Python with Matplotlib and Cern’s Root are examples of the third.

Often it comes down to where a particular scientist did her early
research work: a boss will tell you to “use this tool because it’s what
I use”. I recommend forming a broad knowlege of scientific tools so that
you can use *the most appropriate tool* instead of *the tool that makes
your boss comfortable*. You will often find that astrophysicists often
use Python with matplotlib, particle physicists use Cern’s Root,
biologists use R or Python, social scientists who do much statistical
work use R. You shoud always know the tool your community uses (if it’s
a free software tool), as well as some others which might be more
appropriate.

There are many proprietary plotting packages. I advocate against the use of proprietary software, and it is certainly unacceptable to do science with proprietary tools, but I will mention a couple of packages so that when you come across users you will be able to categorize and compare them and offer an effective free software approach to the same problem.

CricketGraph, often used in the 1990s, was a light-weight plotting program, later supplanted by KaleidaGraph. I would recommend gnuplot as a good way of doing what those programs did.

Matlab and IDL are simple plotting and data analysis programs that grew out of control and added an ad-hoc programming language to the package. The languages were never meant for large software applications, but are often used to write very large programs. It is interesting to note that these programs always start with the stated intention of not requiring a scientist to know how to program, but they end up channeling scientists into using an ill-designed language for large programs.

Matlab and IDL programs can be written in a much cleaner way using Python for the programming parts, and the matplotlib plotting library for graphics.

There is a final outlier in the proprietary data analysis world, which is the “using a spreadsheet to do data analysis” approach, often with the proprietary Excel spreadsheet. There is no saving grace to this approach: apart from technical concerns with the validity of the numerical subroutines, there is also the complete lack of reproducibility of a person moving a mouse around over cells. One of the most embarrassing cases of incorrect analysis was in a much-cited Economics paper about debt rations in European countries [RR10]. The analysis was done with an Excel spreadsheet, and some readers concluded that the authors selected the wrong range of data with a mouse movement. There is no reproducibility when mouse movements are part of the data analysis. The economics article was disgraced because of the faulty analysis as well as other problems with their methodology.

## 2.7. Data formats

In Section 2.2 and Section 2.4 we saw the simplest examples of data files: columns of numbers separated by white space. These are the simplest to work with, and if your files are smaller than about 10 megabytes you should always treat yourself to that simplicity. This format is often called “whitespace separated ascii” or names similar to that.

Often you will find that the columns of data are separated by commas.
This format is called “comma separated values” (csv) and the files
often end with the `.csv`

extension. The format has been around
almost half a century. It has some advantages over the whitespace
separated columns and is used by almost all spreadsheet programs as an
import/export format. In gnuplot you handle this with the instruction
`set datafile separator comma`

as we see in
Section 8.1.

Sometimes files are in a variety of *binary formats*, which you cannot
read directly. We will not deal with these at this time, since we are
not yet working with very big files, but later on we will show how to
convert mp3 files to an ascii format which is easily read by our
programs and by gnuplot.

## 2.8. Simple surface plots

So far we have looked at *line plots*. Let us now look at another type
of plot: the *surface plot*. A surface plot comes from a function of
two variables: \(z = f(x, y)\). In these plots the value of the
function is plotted as the *height* over the x, y plane. Here is an
example:

```
##CAPTION: Simple surface plot.
set grid
set pm3d
set xlabel 'x'
set ylabel 'y'
set samples 50
set isosamples 50
splot exp((-x*x-1.7*y*y)/10.0) with pm3d
```

The plot in Figure 2.8.1 is visually gratifying, and even more so because when you generate it interactively with gnuplot you can grab it with the left mouse button and rotate it to get more insight into what the surface looks like from different angles.

Another way of showing the same information is a *heat
map*. Figure 2.8.2 shows the same function, but this
time the value of the function is represented with *color* instead of
height.

```
##CAPTION: Heat map plot.
set size ratio -1
set view map
set samples 50
set isosamples 50
set xlabel 'x'
set ylabel 'y'
splot exp((-x*x-1.7*y*y)/50.0) with pm3d
```

## 2.9. Topics we have covered

data files

plots

gnuplot

reading features from simple plots

simple surface plots

- 4
In the introductory course I have insets on the board with shell commands,

`emacs`

keybindings, and some Python commands. The`emacs`

keybindings are especially important since the students have not necessarily done the full tutorial.