:tocdepth: 2

.. _power-laws-zipf-benford:

================================
 Power laws, Zipf, Benford, ...
================================

Areas: pure math, curiosity, economics, complex systems

[status: partly written]

Motivation, prerequisites, plan
===============================

We can start by realizing that certain relationships between numbers
are "curious".  Our curiosity might be enough to get us going if we
see a couple of examples.  Here are two examples which might make us
curious:

* Longer words in a language occur less often, and there is a clear
  mathematical relationship between frequency of words and the rank of
  the word in a frequency table (Zipf's law).

* In numbers that come up in the real world, the most significant
  digit is more often 1 and 2 than 8 and 9 (Benford's law).

There are many areas of nature and of human endeavor in which similar
relationships come up.  These relationships are called *power laws*
(where *power* refers to the exponent in a variable).  Many of these
give insight into the underlying mechanisms of those areas of study.

We will start by studying Zipf's law and write some programs that
illustrate it.  This will let us study other data sets that have
similar relationships, some from astronomy, some from geography, some
from economics.  We will look at those examples, discussing the
insights that come from the power law relationships.  Finally we will
dedicate some time to Benford's law and some curious and amusing
historical applications.


A brief refresher on log-log plots
==================================

Let us try plotting a power law function in gnuplot:

.. code-block:: text

   gnuplot> set samples 1000
   gnuplot> plot x**(-1)
   gnuplot> plot [0.01:100] x**(-1)

.. code-block:: text

   gnuplot> set samples 1000
   gnuplot> a = 1
   gnuplot> k = -1.5
   gnuplot> plot [0.01:100] a * x**k
   gnuplot> k = -3.2
   gnuplot> replot [0.01:100] a * x**k

Note how little information you actually see about the plot.  That's
because the scales of the information for x near 0 and for x far out
on the x axis are very different.  You can zoom in with the gnuplot
GUI (right-click and drag a rectangle near the origin), but that
doesn't tell you much more.

We usually take the logarithms of both x and y and then we can see the
nuances of what's happening in between:

.. code-block:: text

   gnuplot> set logscale xy
   gnuplot> set samples 1000
   gnuplot> a = 1
   gnuplot> k = -1.5
   gnuplot> plot [0.01:100] a * x**k
   gnuplot> k = -3.2
   gnuplot> replot [0.01:100] a * x**k

Note that the slope of these "log-log" plots is that exponent k.


Zipf's law
==========

We start with Zipf's law.  In the first half of the 20th century
George Zipf noticed that if you have a large sample of words in a
real-world English language text, there is an inverse proportionality
between the frequency of a word and its position in the frequency
table (see :numref:`sec-inverse-functions` where we introduced inverse
proportionality).

This means that a plot of frequency versus rank will look like a
:math:`1/x` plot, or if you plot both x and y axis on a logarithmic
scale, you will have a straight line with a slope of -1.

How do we explore this law?  As usual, we write a program!  Start by
typing in the program in :numref:`listing-word-freq-rank-py`.

.. _listing-word-freq-rank-py:

.. literalinclude:: word-freq-rank.py
   :language: python
   :caption: word-freq-rank.py - Analyze frequency versus rank for
             words in a text file.


Example from a paragraph of your own.
-------------------------------------

Type a simple paragraph of English text into a file, call it
`simple-para.txt`.  Then run the program on that file with:

.. code-block:: console

   $ python3 word-freq-rank.py simple-para.txt

You'll see output showing the histogram with ASCII output.

You could also have run the program with no arguments and typed
directly in the terminal.


Example from Project Gutenberg
------------------------------

To download a full book from Project Gutenberg and plot its word
frequency distribution with gnuplot you can use these instructions:

.. literalinclude:: plot-word-freq.gp
   :language: gnuplot
   :caption: Instructions to plot the word frequency distribution of
             Swann's Way, by Marcel Proust.

.. _fig-plot-word-freq:

.. figure:: plot-word-freq.*
   :width: 80%

   Histogram of how many times words of a certain length appear in the
   text of Swann's Way (volume 1 of Marcel Proust's "In Search of Lost
   Time", formerly translated in English as "Remembrance of Things
   Past").

The top plot in :numref:`fig-plot-word-freq` is in the usual linear
scale.  This can be hard to read because you have some big numbers,
and the small ones are hard to tell apart.  That's why the other two
plots have a *logarithmic* scale for the y axis (log scales are
discussed in :numref:`sec-population-data-from-the-web`).

Out of curiosity, let us also look at the first few and last few lines
of the output file.

.. literalinclude:: swanns-way-freq-english.out
   :language: none
   :caption: First few lines (most common words) in Swann's Way.
             First column is the rank (most frequent to least
             frequent), the second is the word, the third is the
             number of times it occurs.
   :lines: 1-10

.. literalinclude:: swanns-way-freq-english.out
   :language: none
   :caption: Last few lines (least common words) in Swann's Way.
             First column is the rank (most frequent to least
             frequent), the second is the word, the third is the
             number of times it occurs.
   :lines: 4618-


Now let's look at a comparison of the English translation of this book
versus the original French text:

.. literalinclude:: plot-word-freq-french-english.gp
   :language: gnuplot
   :caption: Instructions to plot the word frequency distribution of
             Swann's Way, by Marcel Proust.  This version compares the
             original French text with the English translation.

.. _fig-plot-word-freq-french-english:

.. figure:: plot-word-freq-french-english.*
   :width: 80%

   Histogram of how many times words of a certain length appear in the
   text of Swann's Way (volume 1 of Marcel Proust's "In Search of Lost
   Time", formerly translated in English as "Remembrance of Things
   Past").  This version compares the original French text with the
   English translation.  Note that the slope for the main part of the
   curve is the same.


Explanations of Zipf's law
--------------------------

TODO


What are "power laws"?
======================

If you have a relationship where the number of items with a given
measure is proportional to that measure to a negative power, then we
say that we have *power law* behavior.

In the case of Zipf's law, if :math:`l` is the length of the words and
:math:`N(l)` is the number of words with that length in our text, we
have:

.. math::

   N(l) = {\rm const} \times l^{-1}

More in general we will have:

.. math::

   N(x) = {\rm const} \times x^{-\alpha}

where :math:`\alpha` is the "power" in "power law".  In Zipf's law the
power is :math:`\alpha = -1`.


Calculating the power from the data
-----------------------------------

How do you calculate the power :math:`\alpha` if you are given a
collection of data points?

In :numref:`chap-fitting-functions-to-data` we learned some techniques
to fit functions to data.  How would we carry out a fit to the data
that brings out the power in the power law?

Zipf's law was just one rather notable example of power law, but there
are many others.  We will discuss a few more and write programs that
download the relevant data and analyze it.

* population-rank distribution for nations

wget
https://raw.githubusercontent.com/datasets/population/master/data/population.csv

* easier:

wget https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2119.txt

* rank-size distribution for nations

* city size

* deadly conflicts

* luminosity in astronomy


Deadly conflicts
================

https://github.com/zacharykitt/analyses/blob/master/deadly_quarrels.py


Benford's law
=============

Physical constants
------------------

.. code-block:: console

   $ wget https://physics.nist.gov/cuu/Constants/Table/allascii.txt
   $ cat allascii.txt | cut -c 61 | grep '[0-9]' > first-digits.dat
   gnuplot> plot 'first-digits.dat'

This has downloaded an ascii table of physical constants (we've talked
about ascii before, but we need to frequently remind people that ascii
is basically plain text...  ascii is not a common buzzword, so we
should mention ascii often, and explain often what it means).

Now we want to look at the distribution of those first digits.  Which
appears most?

In :numref:`sec-intermediate-plotting-histograms` we learned how to
make and plot histograms.  The most effective way to make and plot
histograms might be the snippet of code that uses numpy and matplotlib
in :numref:`sec-a-histogram-snippet`.  Let's use that to write a small
program that analyzes the frequency of these first digits:

.. literalinclude:: first-digit-hist.py
   :language: python
   :caption: first-digit-hist.py - Plot the frequency with which
             digits appear in the first position in a list of numbers.

Go ahead and enter the program ``first-digit-hist.py`` and then run it
with:

.. code-block:: console

   ## to get an interactive plot:
   $ ./first-digit-hist.py first-digits.dat
   ## to put the output into a file:
   $ ./first-digit-hist.py first-digits.dat first-digit-freq.png
   ## or you can replace .png with .pdf or .svg

The result of these runs should be a plot a bit like the one in
:numref:`fig-first-digit-freq`: you will see that 1 occurs much more
frequently, followed by 2 and then 3.  Once you get beyond 4 you would
have to collect a lot of data to get a clear result, but you would
then find that 4 occurs more than 5, which occurs more than 6 and so
on.

.. _fig-first-digit-freq:

.. figure:: first-digit-freq.*
   :width: 60%

   Histogram of how many times the various digits appear as the
   *first* digit in a number.  The list of numbers we use are the
   physical constants in the National Institute of Standards and
   Technology list.

To show an example of how one can use shell pipelines to look at
another data set.  The GNU scientific library sources can be found on
github at 

.. code-block:: console

   $ wget https://raw.githubusercontent.com/ampl/gsl/master/const/gsl_const_mks.h
   $ wget http://git.savannah.gnu.org/cgit/gsl.git/plain/const/gsl_const_cgs.h
   $ cat gsl_const_mks.h | grep 'define GSL_' | awk '{print $3}' | grep -v '(1e' | cut -c 2 > first-digits-gsl-mks.dat
   $ cat gsl_const_cgs.h | grep 'define GSL_' | awk '{print $3}' | grep -v '(1e' | cut -c 2 > first-digits-gsl-cgs.dat
   $ ./first-digit-hist.py first-digits-gsl-cgs.dat &
   $ ./first-digit-hist.py first-digits-gsl-mks.dat &


Stock quotes
------------

Another source of numbers to which Benford's law might apply is stock
quotes.  Let us FIXME

.. code-block:: console

   wget --output-document nasdaq-list.csv 'http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download'


.. literalinclude:: first-digit-nasdaq.py
   :language: python
   :caption: first-digit-nasdaq.py - Plot the frequency with which
             digits appear in the first position of NASDAQ stock
             market listings.

.. _fig-first-digit-nasdaq:

.. figure:: first-digit-nasdaq_sale.*
   :width: 60%

.. figure:: first-digit-nasdaq_cap.*
   :width: 60%

   Histogram of how many times the various digits appear as the
   *first* digit in a number.  The list of numbers we use here are the
   stock quotes and market capitalization in the NASDAQ stock
   exchange.


Further reading
---------------

* https://physics.nist.gov/cuu/Constants/Table/allascii.txt

* https://en.wikipedia.org/wiki/Benford%27s_law

* https://en.wikipedia.org/wiki/Zipf%27s_law

* stats on word length

* /usr/share/dict/words

* https://en.wikipedia.org/wiki/Brown_Corpus

* http://www.nltk.org/nltk_data/

* random book from gutenberg

* stats on individual income or wealth

* War and conflict datasets: https://ucdp.uu.se/downloads/

* Correlates of war study:
  https://en.wikipedia.org/wiki/Correlates_of_War

* Correlates of war data: http://correlatesofwar.org/data-sets/COW-war

* Richardson's original data set:
  https://vincentarelbundock.github.io/Rdatasets/doc/HistData/Quarrels.html

* Detailed paper with many war diagrams:
  https://arxiv.org/pdf/1901.05086.pdf
  https://pdfs.semanticscholar.org/27cd/75962a447881768698ac1402100938d5f790.pdf


Pareto's principle
==================


Further reading
---------------

* https://www.theguardian.com/commentisfree/2011/nov/11/occupy-movement-wealth-power-law-distribution


Olber's paradox
===============