14. Power laws, Zipf, Benford, …

Areas: pure math, curiosity, economics, complex systems

[status: partly written]

14.1. Motivation, prerequisites, plan

We can start by realizing that certain relationships between numbers are “curious”. Our curiosity might be enough to get us going if we see a couple of examples. Here are two examples which might make us curious:

  • Longer words in a language occur less often, and there is a clear mathematical relationship between frequency of words and the rank of the word in a frequency table (Zipf’s law).

  • In numbers that come up in the real world, the most significant digit is more often 1 and 2 than 8 and 9 (Benford’s law).

There are many areas of nature and of human endeavor in which similar relationships come up. These relationships are called power laws (where power refers to the exponent in a variable). Many of these give insight into the underlying mechanisms of those areas of study.

We will start by studying Zipf’s law and write some programs that illustrate it. This will let us study other data sets that have similar relationships, some from astronomy, some from geography, some from economics. We will look at those examples, discussing the insights that come from the power law relationships. Finally we will dedicate some time to Benford’s law and some curious and amusing historical applications.

14.2. A brief refresher on log-log plots

Let us try plotting a power law function in gnuplot:

gnuplot> set samples 1000
gnuplot> plot x**(-1)
gnuplot> plot [0.01:100] x**(-1)
gnuplot> set samples 1000
gnuplot> a = 1
gnuplot> k = -1.5
gnuplot> plot [0.01:100] a * x**k
gnuplot> k = -3.2
gnuplot> replot [0.01:100] a * x**k

Note how little information you actually see about the plot. That’s because the scales of the information for x near 0 and for x far out on the x axis are very different. You can zoom in with the gnuplot GUI (right-click and drag a rectangle near the origin), but that doesn’t tell you much more.

We usually take the logarithms of both x and y and then we can see the nuances of what’s happening in between:

gnuplot> set logscale xy
gnuplot> set samples 1000
gnuplot> a = 1
gnuplot> k = -1.5
gnuplot> plot [0.01:100] a * x**k
gnuplot> k = -3.2
gnuplot> replot [0.01:100] a * x**k

Note that the slope of these “log-log” plots is that exponent k.

14.3. Zipf’s law

We start with Zipf’s law. In the first half of the 20th century George Zipf noticed that if you have a large sample of words in a real-world English language text, there is an inverse proportionality between the frequency of a word and its position in the frequency table (see Section 4.5 where we introduced inverse proportionality).

This means that a plot of frequency versus rank will look like a \(1/x\) plot, or if you plot both x and y axis on a logarithmic scale, you will have a straight line with a slope of -1.

How do we explore this law? As usual, we write a program! Start by typing in the program in Listing 14.3.1.

Listing 14.3.1 word-freq-rank.py - Analyze frequency versus rank for words in a text file.
#! /usr/bin/env python3

"""
Reads all the words in a file and prints information about the
rank and frequence of occurrence of words in the file.

The file should be a rather long file with a typical sampling of
words.  The ideal file would be a book downloaded from Project
Gutenberg in ascii text format.
"""

import sys
import re

def main():
    if len(sys.argv) == 1:      # handle command-line arguments
        f = sys.stdin
    elif len(sys.argv) == 2:
        fname = sys.argv[1]
        f = open(fname, 'r')
    else:
        sys.stderr.write('error - usage: %s [filename]\n' % sys.argv[0])
        sys.exit(1)

    sorted_words, word_freq_map = read_words_from_file(f)
    f.close()
    print('##  file:', fname)
    print('##  rank  word            frequency')
    for i, word in enumerate(sorted_words):
        print('%8d  %-13s  %8d' % (i+1, word, word_freq_map[word]))

def read_words_from_file(f):
    """read the words from a file and return two things: the sorted
    list of words, and the rank dictionary which maps each word to its
    rank."""
    word_set = set()
    word_freq_map = {}
    for line in f.readlines():
        word_list = re.split('--|\s', line)
        ## now that we have the words, let's strip away all the
        ## punctuation marks
        word_list = [word.strip(""",.;:_"'&%^$#@!?/\|+-()*""").lower()
                     for word in word_list]
        cleaned_up_words = []
        for word in word_list:
            word = word.strip(""",.;:_"'&%^$#@!?/\|+-()*""")
            if len(word) > 0:
                cleaned_up_words.append(word)
        ## now that we have found the words in this line, we also add
        ## them to the word rank dictionary before we lost the count
        ## information by adding them to the set
        for word in word_list:
            if word in word_freq_map.keys():
                word_freq_map[word] += 1
            else:
                word_freq_map[word] = 1
        ## finally, add them to a set, which discards repeated
        ## occurrences
        word_set.update(tuple(cleaned_up_words))

    ## finally: use the rank dictionary to sort the list of words by
    ## how often they occur (their rank)
    sorted_word_list = sorted(list(word_set), key=lambda x: -word_freq_map[x])
    return sorted_word_list, word_freq_map


main()

14.3.1. Example from a paragraph of your own.

Type a simple paragraph of English text into a file, call it simple-para.txt. Then run the program on that file with:

$ python3 word-freq-rank.py simple-para.txt

You’ll see output showing the histogram with ASCII output.

You could also have run the program with no arguments and typed directly in the terminal.

14.3.2. Example from Project Gutenberg

To download a full book from Project Gutenberg and plot its word frequency distribution with gnuplot you can use these instructions:

Listing 14.3.2 Instructions to plot the word frequency distribution of Swann’s Way, by Marcel Proust.
##REQUIRED_FILE: swanns-way.txt
##PRE_EXEC: wget --output-document swanns-way-english.txt http://www.gutenberg.org/cache/epub/1128/pg1128.txt
##PRE_EXEC: ./word-length-freq.py swanns-way-english.txt > swanns-way-freq-english.out
set multiplot layout 2,1 title "Zipf's law"
set grid
plot 'swanns-way-freq-english.out' using 1:3 with linespoints pt 6 ps 0.2 title "word frequency (linear scale)"
set logscale xy
plot 'swanns-way-freq-english.out' using 1:3 with linespoints pt 6 ps 0.2 title "word frequency (log-log scale)"
../_images/plot-word-freq.svg

Figure 14.3.1 Histogram of how many times words of a certain length appear in the text of Swann’s Way (volume 1 of Marcel Proust’s “In Search of Lost Time”, formerly translated in English as “Remembrance of Things Past”).

The top plot in Figure 14.3.1 is in the usual linear scale. This can be hard to read because you have some big numbers, and the small ones are hard to tell apart. That’s why the other two plots have a logarithmic scale for the y axis (log scales are discussed in Section 8.1).

Out of curiosity, let us also look at the first few and last few lines of the output file.

Listing 14.3.3 First few lines (most common words) in Swann’s Way. First column is the rank (most frequent to least frequent), the second is the word, the third is the number of times it occurs.
##  file: swanns-way-english.txt
##  rank  word            frequency
       1  the                 903
       2  and                 784
       3  i                   650
       4  to                  570
       5  of                  554
       6  you                 486
       7  my                  458
       8  a                   419
Listing 14.3.4 Last few lines (least common words) in Swann’s Way. First column is the rank (most frequent to least frequent), the second is the word, the third is the number of times it occurs.
    4616  ballow                1
    4617  starv'd               1
    4618  derive                1
    4619  official              1
    4620  succeed               1
    4621  cities                1
    4622  lately                1
    4623  whatever              1
    4624  simp'ring             1
    4625  constrains            1
    4626  sue                   1
    4627  vor                   1

Now let’s look at a comparison of the English translation of this book versus the original French text:

Listing 14.3.5 Instructions to plot the word frequency distribution of Swann’s Way, by Marcel Proust. This version compares the original French text with the English translation.
##REQUIRED_FILE: swanns-way-french.txt
##REQUIRED_FILE: swanns-way-english.txt
##PRE_EXEC: wget --output-document swanns-way-english.txt http://www.gutenberg.org/cache/epub/1128/pg1128.txt
##PRE_EXEC: wget --output-document swanns-way-french.txt https://www.gutenberg.org/files/2650/2650-0.txt
##PRE_EXEC: ./word-length-freq.py swanns-way-english.txt > swanns-way-freq-english.out
##PRE_EXEC: ./word-length-freq.py swanns-way-french.txt > swanns-way-freq-french.out
set grid
set ylabel 'word frequency'
set xlabel 'word rank'
set logscale xy
set title "rank-frequency relationship for Swann's Way, comparing French and English"
plot 'swanns-way-freq-english.out' using 1:3 with linespoints, 'swanns-way-freq-french.out' using 1:3 with linespoints
../_images/plot-word-freq-french-english.svg

Figure 14.3.2 Histogram of how many times words of a certain length appear in the text of Swann’s Way (volume 1 of Marcel Proust’s “In Search of Lost Time”, formerly translated in English as “Remembrance of Things Past”). This version compares the original French text with the English translation. Note that the slope for the main part of the curve is the same.

14.3.3. Explanations of Zipf’s law

TODO

14.4. What are “power laws”?

If you have a relationship where the number of items with a given measure is proportional to that measure to a negative power, then we say that we have power law behavior.

In the case of Zipf’s law, if \(l\) is the length of the words and \(N(l)\) is the number of words with that length in our text, we have:

\[N(l) = {\rm const} \times l^{-1}\]

More in general we will have:

\[N(x) = {\rm const} \times x^{-\alpha}\]

where \(\alpha\) is the “power” in “power law”. In Zipf’s law the power is \(\alpha = -1\).

14.4.1. Calculating the power from the data

How do you calculate the power \(\alpha\) if you are given a collection of data points?

In Section 7 we learned some techniques to fit functions to data. How would we carry out a fit to the data that brings out the power in the power law?

Zipf’s law was just one rather notable example of power law, but there are many others. We will discuss a few more and write programs that download the relevant data and analyze it.

  • population-rank distribution for nations

wget https://raw.githubusercontent.com/datasets/population/master/data/population.csv

  • easier:

wget https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2119.txt

  • rank-size distribution for nations

  • city size

  • deadly conflicts

  • luminosity in astronomy

14.5. Deadly conflicts

https://github.com/zacharykitt/analyses/blob/master/deadly_quarrels.py

14.6. Benford’s law

14.6.1. Physical constants

$ wget https://physics.nist.gov/cuu/Constants/Table/allascii.txt
$ cat allascii.txt | cut -c 61 | grep '[0-9]' > first-digits.dat
gnuplot> plot 'first-digits.dat'

This has downloaded an ascii table of physical constants (we’ve talked about ascii before, but we need to frequently remind people that ascii is basically plain text… ascii is not a common buzzword, so we should mention ascii often, and explain often what it means).

Now we want to look at the distribution of those first digits. Which appears most?

In Section 3.2 we learned how to make and plot histograms. The most effective way to make and plot histograms might be the snippet of code that uses numpy and matplotlib in Section 3.4. Let’s use that to write a small program that analyzes the frequency of these first digits:

Listing 14.6.1 first-digit-hist.py - Plot the frequency with which digits appear in the first position in a list of numbers.
#! /usr/bin/env python3

## run this with
## ./first-digit-hist.py first-digits.dat
## to get an interactive plot, or with
## ./first-digit-hist.py first-digits.dat first-digit-freq.png
## (or you can put pdf or svg)

import sys
import numpy as np
import matplotlib.pyplot as plt

fname = sys.argv[1]
out_fname = ''
if len(sys.argv) == 3:
    out_fname = sys.argv[2]
    out_format = sys.argv[2][-3:]

## [...] collect quantity to be histogrammed in an array x
xs = open(fname, 'r').read().split()
x = [int(digit) for digit in xs]
## prepare the plot
n, bins, patches = plt.hist(x, bins=np.arange(1,11)-0.5, 
                            density=1, facecolor='g', alpha=0.75)
plt.xticks(range(1,10))
plt.xlim([0,10])
plt.xlabel('first digit')
plt.ylabel('probability')
plt.title('Histogram of first digit probability')
plt.grid(True)
## if there is no output filename then we plot interactively
if not out_fname:
    plt.show()
else:
    plt.savefig(out_fname, format=out_format)
    print('output written to file %s' % out_fname)

Go ahead and enter the program first-digit-hist.py and then run it with:

## to get an interactive plot:
$ ./first-digit-hist.py first-digits.dat
## to put the output into a file:
$ ./first-digit-hist.py first-digits.dat first-digit-freq.png
## or you can replace .png with .pdf or .svg

The result of these runs should be a plot a bit like the one in Figure 14.6.1: you will see that 1 occurs much more frequently, followed by 2 and then 3. Once you get beyond 4 you would have to collect a lot of data to get a clear result, but you would then find that 4 occurs more than 5, which occurs more than 6 and so on.

../_images/first-digit-freq.svg

Figure 14.6.1 Histogram of how many times the various digits appear as the first digit in a number. The list of numbers we use are the physical constants in the National Institute of Standards and Technology list.

To show an example of how one can use shell pipelines to look at another data set. The GNU scientific library sources can be found on github at

$ wget https://raw.githubusercontent.com/ampl/gsl/master/const/gsl_const_mks.h
$ wget http://git.savannah.gnu.org/cgit/gsl.git/plain/const/gsl_const_cgs.h
$ cat gsl_const_mks.h | grep 'define GSL_' | awk '{print $3}' | grep -v '(1e' | cut -c 2 > first-digits-gsl-mks.dat
$ cat gsl_const_cgs.h | grep 'define GSL_' | awk '{print $3}' | grep -v '(1e' | cut -c 2 > first-digits-gsl-cgs.dat
$ ./first-digit-hist.py first-digits-gsl-cgs.dat &
$ ./first-digit-hist.py first-digits-gsl-mks.dat &

14.6.2. Stock quotes

Another source of numbers to which Benford’s law might apply is stock quotes. Let us FIXME

wget --output-document nasdaq-list.csv 'http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download'
Listing 14.6.2 first-digit-nasdaq.py - Plot the frequency with which digits appear in the first position of NASDAQ stock market listings.
#! /usr/bin/env python3

## run this with
## ./first-digit-hist.py first-digits.dat
## to get an interactive plot, or with
## ./first-digit-hist.py first-digits.dat first-digit-freq.png
## (or you can put pdf or svg)

import sys
import numpy as np
import matplotlib.pyplot as plt
import csv

fname = sys.argv[1]
out_fbase = ''
if len(sys.argv) == 3:
    out_fbase = sys.argv[2][:-4]
    out_format = sys.argv[2][-3:]

print(out_fbase)

csvfile = open(fname, 'r')
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
firstdigit_sale = []
firstdigit_cap = []

for row in list(reader)[1:]:
    print(row[0], row[2], row[3])
    if row[2] != 'n/a':
        sale = float(row[2])
        if sale != 0:
            firstdigit = ('%e' % sale)[0]
            firstdigit_sale.append(int(firstdigit))
    if row[3] != 'n/a':
        cap = float(row[3])
        if cap != 0:
            firstdigit = ('%e' % cap)[0]
            firstdigit_cap.append(int(firstdigit))
csvfile.close()

## prepare the plot: collect quantity to be histogrammed in an array x
for (label, x) in [('sale', firstdigit_sale), ('cap', firstdigit_cap)]:
    n, bins, patches = plt.hist(x, bins=np.arange(1, 11)-0.5,
                                density=1, facecolor='g', alpha=0.75)
    plt.xticks(range(1, 10))
    plt.xlim([0, 10])
    plt.xlabel('first digit')
    plt.ylabel('probability')
    plt.title('Histogram of first digit probability')
    plt.grid(True)
    ## if there is no output filename then we plot interactively
    if not out_fbase:
        plt.show()
    else:
        out_fname = out_fbase + '_' + label + '.' + out_format
        plt.savefig(out_fname, format=out_format)
        plt.gcf().clear()
        print('output written to file %s' % out_fname)
../_images/first-digit-nasdaq_sale.svg
../_images/first-digit-nasdaq_cap.svg

Figure 14.6.2 Histogram of how many times the various digits appear as the first digit in a number. The list of numbers we use here are the stock quotes and market capitalization in the NASDAQ stock exchange.

14.6.3. Further reading

14.7. Pareto’s principle

14.7.1. Further reading

14.8. Olber’s paradox