******************************* Fourier analysis on real data ******************************* In the previous chapter we learned about Fourier Analysis and showed how to take the Fourier transform of some known functions. Here we will show how to apply the Fourier transform to "data", for example to data coming from experiments. We will start with some artificial data, and then look at the output from an experiment. Discrete data versus functions ============================== Until now we have discussed taking the Fourier transform of *a function*: a signal :math:`s(t)` that is either periodic or is defined on a finite interval so that you can just duplicate it along the :math:`t` axis and make it become periodic. But most of our applications will be with *data*, typically *time series data*. At this point the instructor should discuss what discrete data looks like. This discussion of discrete data should not try to address every possible issue --- it is enough to show that you have the correspondence of: .. math:: \textrm{continuous:} \; & s(t) & & \\ \textrm{discrete:} \; & \{s_k\} & & \\ \textrm{continuous:} \; & a_k & = \; & \frac{1}{T} \int_0^T s(t) \cos(\frac{2\pi}{T} t) dt \\ \textrm{discrete:} \; & a_k & = \; & \frac{1}{T} \sum_{n=0}^{N-1} s_k \cos(2\pi \frac{k}{N} n) dt What's hidden in noisy data? ============================ First a discussion of what noise is. The hiss of a tape, or imperfections in an image (like blur) are examples. It is useful to bring up an image of noise, and the Wikipedia article on white noise has one: https://en.wikipedia.org/wiki/White_noise At this point I zoom in on the waveform of Gaussian the white noise signal to demonstrate what it looks like. If you add this noise to a meaningful signal then the signal might get lost in the noise. But Fourier analysis can come to the rescue: we can think of the spectrum (how strong each frequency is in a signal) as a sort of "character" of the signal, a deep description which might not be obvious on the surface, but it is there. Here we will do some experiments to demonstrate this idea, and then we will use this idea to devise a marvel of data analysis: *signal filtering*. We start by building up some contrived signals and nibble around at these contrived signals to get an intuition for how high frequency :math:`sin()` waves are suggestive of noise. Contrived signals with 2 or 3 periods ------------------------------------- Start with this plot: .. rubric:: plotting in desmos :: sin(x) + 0.3 sin(365 x) .. rubric:: plotting in gnuplot :: $ gnuplot # at the gnuplot> prompt type: set samples 1000 plot sin(x) + 0.3*sin(365*x) and zoom in a lot. What initially looked like noise is now revealed to be a *high frequency* sin wave added to the original. Your zooming might suggest to you that the high frequency :math:`\sin(365 x)` term added *spikes* to the original signal. The spikes were all the same height, so it is clearly not noise, but it is *suggestive* of the fact that noise might come from high frequency portions of the signal. Now plot a simple superposition of three sin waves with period :math:`T = 5`: .. math:: s(t) = \; &\sin(\frac{2 \pi}{T} t) \\ s(t) = \; &\sin(\frac{2 \pi}{5} x) - 0.4 \sin(\frac{2 \pi}{5} 2 x) - 0.3 \sin(\frac{2 \pi}{5} 3 x) :: sin((2 \pi / 5) x) - 0.4 sin((2 \pi / 5) 2 x) - 0.3 sin((2 \pi / 5) 3 x) Spend some time looking at this plot and try to spot the 3 periods embedded in there. Filtering those contrived signals with 2 or 3 periods ----------------------------------------------------- Then we will download the program :download:`code/filter_toy.py` and look at how it works. Then we will run it with: :: python3 filter_toy.py to see an output like that in :numref:`fig-filter_toy`. .. _fig-filter_toy: .. figure:: filter_toy.png :width: 60% An example of filtering for a very simple signal: the sum of 3 sin waves. The first filter zeros out all the *high frequency* fourier coefficients (low pass filter). The second one zeroes out all the *low frequency* fourier coefficients (high pass filter). An artificial noisy signal -------------------------- Now we will download the program :download:`code/filter_random.py` and look at how it works. Once the program is downloaded we change the noise strength Then we will run it with: :: python3 filter_random.py to see an output like that in :numref:`fig-filter_random`. .. _fig-filter_random: .. figure:: filter_random.png :width: 60% An example of filtering for a noisy signal: a sin() wave with random noise added to it. The first filter zeros out all the *high frequency* fourier coefficients (low pass filter). The second one zeroes out all the *low frequency* fourier coefficients (high pass filter). The instructor can now experiment changing the ``0.4*np.random.randn(...)`` to be ``3.4*`` -- this will dwarf the signal with noise, but we will still find an approximate sin wave. Another interest change is to apply the lowpass filter to the highpass filter result (instead of to the original signal). This will give a *bandpass* filter: a narrow band around the fundamental frequency. All these experiments can be done live in class. Experimental double-period noisy data ------------------------------------- Let us explore the data from a weather station. We will pick the Las Cruces weather station from the climate research network coordinated by NOAA (National Oceanic and Atmospheric Administration). We will find that this data has *both* the features we saw above: it has an underlying doubly periodic behavior, and it also has noise. It also has plenty more information, related to weather (and, over the course of many years) climate change, but we will focus on finding the two periodic signals, and on the noise. How do you explore the NOAA data sets? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you just go to https://www.ncei.noaa.gov/ you will find a lot of visually pleasing stuff, but not immediate access to raw data. Instead try to start at: https://www.ncei.noaa.gov/pub/ Then click on data, then on uscrn (US Climate Research Network), and on products, and you will end up here: https://www.ncei.noaa.gov/pub/data/uscrn/products/ Now we want the "subhourly01" subdirectory (data recorded every 5 minutes): https://www.ncei.noaa.gov/pub/data/uscrn/products/subhourly01/ We spend a bit of time looking at the files "headers.txt" and "readme.txt" to get a feeling for what the columns of data mean. Once we know that column 4 (starting at 1) is local date, column 5 is local time, and colum 9 is air temperature, we are ready to look at, let's say, the 2014 data for Las Cruces. Getting the Las Cruces data ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Start with: :: wget https://www.ncei.noaa.gov/pub/data/uscrn/products/subhourly01/2014/CRNS0101-05-2014-NM_Las_Cruces_20_N.txt Then plot the 9th column in gnuplot with: :: $ gnuplot # then at the gnuplot> promt type: plot 'CRNS0101-05-2014-NM_Las_Cruces_20_N.txt' using 9 with lines We immediately see two undesirable spikes that come down to almost -10000. We spend some time looking at the file, we puzzle over it, and then realize that: .. caution:: Data often needs cleaning. In this case you see that an old (and poor) programmer habit crept into the pipeline of programs that generated this data: when samples were missing they replaced them with ``-9999.0``. The files are also a jumble of numbers, so to understand what is in those files we visit https://www.ncei.noaa.gov/pub/data/uscrn/products/subhourly01/headers.txt and https://www.ncei.noaa.gov/pub/data/uscrn/products/subhourly01/readme.txt to see what columns mean what. We conclude that we have data spaced out by 5 minutes, and the 9th column is a temperature reading, and that we need to eliminate all lines that have -9999.0 in them. Remembering that ``grep -v`` *excludes* all lines that match an expression, we can do: .. code-block:: console $ grep -v 9999 CRNS0101-05-2014-NM_Las_Cruces_20_N.txt > Las_Cruces_filtered.txt $ gnuplot # then at the gnuplot> prompt type: plot 'Las_Cruces_filtered.txt' using 9 with lines Note that you can also make a .csv file with: .. code-block:: console $ cat Las_Cruces_filtered.txt | sed 's/ */, /g' > Las_Cruces_filtered.csv We might want to look at more than one year of data. We can do this with: :: for year in 2014 2015 2016 do wget https://www.ncei.noaa.gov/pub/data/uscrn/products/subhourly01/${year}/CRNS0101-05-${year}-NM_Las_Cruces_20_N.txt done cat CRNS*Las_Cruces*.txt | grep -v -e -9999.0 > temperatures.dat and now we have several years to plot: .. code-block:: console $ gnuplot $ # then at the gnuplot> prompt type: plot 'temperatures.dat' using 9 with lines Now let us download the program :download:`code/temperature_fft.py` that analyzes this data with the fourier transform, and run it: .. code-block:: console $ python3 temperature_fft.py .. caution:: These are big data files: 5 minute intervals for a year make for more than 100 thousand samples per year. Three years of data will require the Fourier transform of a signal with more than 300000 samples, which might strain the CPU and/or memory of your computer. This might be even more of a problem if you use an web-based python interpreter. A breather and some terminology =============================== We have just gone through three examples of using the Fourier transform on real data. This was a lot of new material, so let us take a moment and discuss what we have seen. First some terminology that accompanies these new ideas: Continuous Fourier Transform The coefficients of the sin and cos functions at various frequencies which allow us to reproduce a periodic function of *continuous* time. We discussed this in :numref:`chap-fourier-series-the-bones-of-a-function` Discrete Fourier Transform (DFT) A form of Fourier analysis that is applicable to a sequence of values. Instead of a function :math:`f(t)` with :math:`t` being a real number, we have a function given by pairs :math:`{t_k, f_k}`. The :math:`t_k` values are discrete time samples, and :math:`f_k` are the samples of the function at those time points. The DFT is clearly what we will be using to examine experimental data. Fast Fourier Transform (FFT) An algorithm to calculate the DFT in :math:`O(N\log N)` time instead of :math:`O(N^2)`. This is also known as the Cooley Tuckey algorithm. So is it DFT or FFT? The FFT is used almost universally to calculate the DFT, so FFT is sometimes used to mean the specific algorithm used, and sometimes just as a placeholder for DFT. Signal processing A field of electrical engineering focused on analyzing, modifying, and synthesizing signals. Signals can refer to sound, images, and scientific measurements. Noise Unwanted distortions that a signal can suffer during propagation, transmission, and processing. Often the specific origin of the noise cannot be picked out, but it can be modeled in various ways that help with finding a signal that might otherwise be obscured. White noise A form of noise that gets added to a signal and that has equal intensity at different frequencies. In acoustics it sounds like a hissing sound.