:tocdepth: 2 .. _power-laws-zipf-benford: ================================ Power laws, Zipf, Benford, ... ================================ Areas: pure math, curiosity, economics, complex systems [status: partly written] Motivation, prerequisites, plan =============================== We can start by realizing that certain relationships between numbers are "curious". Our curiosity might be enough to get us going if we see a couple of examples. Here are two examples which might make us curious: * Longer words in a language occur less often, and there is a clear mathematical relationship between frequency of words and the rank of the word in a frequency table (Zipf's law). * In numbers that come up in the real world, the most significant digit is more often 1 and 2 than 8 and 9 (Benford's law). There are many areas of nature and of human endeavor in which similar relationships come up. These relationships are called *power laws* (where *power* refers to the exponent in a variable). Many of these give insight into the underlying mechanisms of those areas of study. We will start by studying Zipf's law and write some programs that illustrate it. This will let us study other data sets that have similar relationships, some from astronomy, some from geography, some from economics. We will look at those examples, discussing the insights that come from the power law relationships. Finally we will dedicate some time to Benford's law and some curious and amusing historical applications. A brief refresher on log-log plots ================================== Let us try plotting a power law function in gnuplot: .. code-block:: text gnuplot> set samples 1000 gnuplot> plot x**(-1) gnuplot> plot [0.01:100] x**(-1) .. code-block:: text gnuplot> set samples 1000 gnuplot> a = 1 gnuplot> k = -1.5 gnuplot> plot [0.01:100] a * x**k gnuplot> k = -3.2 gnuplot> replot [0.01:100] a * x**k Note how little information you actually see about the plot. That's because the scales of the information for x near 0 and for x far out on the x axis are very different. You can zoom in with the gnuplot GUI (right-click and drag a rectangle near the origin), but that doesn't tell you much more. We usually take the logarithms of both x and y and then we can see the nuances of what's happening in between: .. code-block:: text gnuplot> set logscale xy gnuplot> set samples 1000 gnuplot> a = 1 gnuplot> k = -1.5 gnuplot> plot [0.01:100] a * x**k gnuplot> k = -3.2 gnuplot> replot [0.01:100] a * x**k Note that the slope of these "log-log" plots is that exponent k. Zipf's law ========== We start with Zipf's law. In the first half of the 20th century George Zipf noticed that if you have a large sample of words in a real-world English language text, there is an inverse proportionality between the frequency of a word and its position in the frequency table (see :numref:`sec-inverse-functions` where we introduced inverse proportionality). This means that a plot of frequency versus rank will look like a :math:`1/x` plot, or if you plot both x and y axis on a logarithmic scale, you will have a straight line with a slope of -1. How do we explore this law? As usual, we write a program! Start by typing in the program in :numref:`listing-word-freq-rank-py`. .. _listing-word-freq-rank-py: .. literalinclude:: word-freq-rank.py :language: python :caption: word-freq-rank.py - Analyze frequency versus rank for words in a text file. Example from a paragraph of your own. ------------------------------------- Type a simple paragraph of English text into a file, call it `simple-para.txt`. Then run the program on that file with: .. code-block:: console $ python3 word-freq-rank.py simple-para.txt You'll see output showing the histogram with ASCII output. You could also have run the program with no arguments and typed directly in the terminal. Example from Project Gutenberg ------------------------------ To download a full book from Project Gutenberg and plot its word frequency distribution with gnuplot you can use these instructions: .. literalinclude:: plot-word-freq.gp :language: gnuplot :caption: Instructions to plot the word frequency distribution of Swann's Way, by Marcel Proust. .. _fig-plot-word-freq: .. figure:: plot-word-freq.* :width: 80% Histogram of how many times words of a certain length appear in the text of Swann's Way (volume 1 of Marcel Proust's "In Search of Lost Time", formerly translated in English as "Remembrance of Things Past"). The top plot in :numref:`fig-plot-word-freq` is in the usual linear scale. This can be hard to read because you have some big numbers, and the small ones are hard to tell apart. That's why the other two plots have a *logarithmic* scale for the y axis (log scales are discussed in :numref:`sec-population-data-from-the-web`). Out of curiosity, let us also look at the first few and last few lines of the output file. .. literalinclude:: swanns-way-freq-english.out :language: none :caption: First few lines (most common words) in Swann's Way. First column is the rank (most frequent to least frequent), the second is the word, the third is the number of times it occurs. :lines: 1-10 .. literalinclude:: swanns-way-freq-english.out :language: none :caption: Last few lines (least common words) in Swann's Way. First column is the rank (most frequent to least frequent), the second is the word, the third is the number of times it occurs. :lines: 4618- Now let's look at a comparison of the English translation of this book versus the original French text: .. literalinclude:: plot-word-freq-french-english.gp :language: gnuplot :caption: Instructions to plot the word frequency distribution of Swann's Way, by Marcel Proust. This version compares the original French text with the English translation. .. _fig-plot-word-freq-french-english: .. figure:: plot-word-freq-french-english.* :width: 80% Histogram of how many times words of a certain length appear in the text of Swann's Way (volume 1 of Marcel Proust's "In Search of Lost Time", formerly translated in English as "Remembrance of Things Past"). This version compares the original French text with the English translation. Note that the slope for the main part of the curve is the same. Explanations of Zipf's law -------------------------- TODO What are "power laws"? ====================== If you have a relationship where the number of items with a given measure is proportional to that measure to a negative power, then we say that we have *power law* behavior. In the case of Zipf's law, if :math:`l` is the length of the words and :math:`N(l)` is the number of words with that length in our text, we have: .. math:: N(l) = {\rm const} \times l^{-1} More in general we will have: .. math:: N(x) = {\rm const} \times x^{-\alpha} where :math:`\alpha` is the "power" in "power law". In Zipf's law the power is :math:`\alpha = -1`. Calculating the power from the data ----------------------------------- How do you calculate the power :math:`\alpha` if you are given a collection of data points? In :numref:`chap-fitting-functions-to-data` we learned some techniques to fit functions to data. How would we carry out a fit to the data that brings out the power in the power law? Zipf's law was just one rather notable example of power law, but there are many others. We will discuss a few more and write programs that download the relevant data and analyze it. * population-rank distribution for nations wget https://raw.githubusercontent.com/datasets/population/master/data/population.csv * easier: wget https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2119.txt * rank-size distribution for nations * city size * deadly conflicts * luminosity in astronomy Deadly conflicts ================ https://github.com/zacharykitt/analyses/blob/master/deadly_quarrels.py Benford's law ============= Physical constants ------------------ .. code-block:: console $ wget https://physics.nist.gov/cuu/Constants/Table/allascii.txt $ cat allascii.txt | cut -c 61 | grep '[0-9]' > first-digits.dat gnuplot> plot 'first-digits.dat' This has downloaded an ascii table of physical constants (we've talked about ascii before, but we need to frequently remind people that ascii is basically plain text... ascii is not a common buzzword, so we should mention ascii often, and explain often what it means). Now we want to look at the distribution of those first digits. Which appears most? In :numref:`sec-intermediate-plotting-histograms` we learned how to make and plot histograms. The most effective way to make and plot histograms might be the snippet of code that uses numpy and matplotlib in :numref:`sec-a-histogram-snippet`. Let's use that to write a small program that analyzes the frequency of these first digits: .. literalinclude:: first-digit-hist.py :language: python :caption: first-digit-hist.py - Plot the frequency with which digits appear in the first position in a list of numbers. Go ahead and enter the program ``first-digit-hist.py`` and then run it with: .. code-block:: console ## to get an interactive plot: $ ./first-digit-hist.py first-digits.dat ## to put the output into a file: $ ./first-digit-hist.py first-digits.dat first-digit-freq.png ## or you can replace .png with .pdf or .svg The result of these runs should be a plot a bit like the one in :numref:`fig-first-digit-freq`: you will see that 1 occurs much more frequently, followed by 2 and then 3. Once you get beyond 4 you would have to collect a lot of data to get a clear result, but you would then find that 4 occurs more than 5, which occurs more than 6 and so on. .. _fig-first-digit-freq: .. figure:: first-digit-freq.* :width: 60% Histogram of how many times the various digits appear as the *first* digit in a number. The list of numbers we use are the physical constants in the National Institute of Standards and Technology list. To show an example of how one can use shell pipelines to look at another data set. The GNU scientific library sources can be found on github at .. code-block:: console $ wget https://raw.githubusercontent.com/ampl/gsl/master/const/gsl_const_mks.h $ wget http://git.savannah.gnu.org/cgit/gsl.git/plain/const/gsl_const_cgs.h $ cat gsl_const_mks.h | grep 'define GSL_' | awk '{print $3}' | grep -v '(1e' | cut -c 2 > first-digits-gsl-mks.dat $ cat gsl_const_cgs.h | grep 'define GSL_' | awk '{print $3}' | grep -v '(1e' | cut -c 2 > first-digits-gsl-cgs.dat $ ./first-digit-hist.py first-digits-gsl-cgs.dat & $ ./first-digit-hist.py first-digits-gsl-mks.dat & Stock quotes ------------ Another source of numbers to which Benford's law might apply is stock quotes. Let us FIXME .. code-block:: console wget --output-document nasdaq-list.csv 'http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download' .. literalinclude:: first-digit-nasdaq.py :language: python :caption: first-digit-nasdaq.py - Plot the frequency with which digits appear in the first position of NASDAQ stock market listings. .. _fig-first-digit-nasdaq: .. figure:: first-digit-nasdaq_sale.* :width: 60% .. figure:: first-digit-nasdaq_cap.* :width: 60% Histogram of how many times the various digits appear as the *first* digit in a number. The list of numbers we use here are the stock quotes and market capitalization in the NASDAQ stock exchange. Further reading --------------- * https://physics.nist.gov/cuu/Constants/Table/allascii.txt * https://en.wikipedia.org/wiki/Benford%27s_law * https://en.wikipedia.org/wiki/Zipf%27s_law * stats on word length * /usr/share/dict/words * https://en.wikipedia.org/wiki/Brown_Corpus * http://www.nltk.org/nltk_data/ * random book from gutenberg * stats on individual income or wealth * War and conflict datasets: https://ucdp.uu.se/downloads/ * Correlates of war study: https://en.wikipedia.org/wiki/Correlates_of_War * Correlates of war data: http://correlatesofwar.org/data-sets/COW-war * Richardson's original data set: https://vincentarelbundock.github.io/Rdatasets/doc/HistData/Quarrels.html * Detailed paper with many war diagrams: https://arxiv.org/pdf/1901.05086.pdf https://pdfs.semanticscholar.org/27cd/75962a447881768698ac1402100938d5f790.pdf Pareto's principle ================== Further reading --------------- * https://www.theguardian.com/commentisfree/2011/nov/11/occupy-movement-wealth-power-law-distribution Olber's paradox ===============