.. _crypto:

==============
 Cryptography
==============

[status: content-mostly-written]

We will discuss a narrow slice of cryptography.  We will start with a
weak approach, and then look at how you could get a stronger
encryption with key exchange and a pseudorandom number generator.

Preliminary: ASCII values
=========================

A key aspect of some of the techniques we will use is that each
character correponds to an integer value.  In English such integers
are called the *ASCII values* of characters, and they are sequential
for sequential letters.  ASCII stands for "American Standard Code for
Information Interchange".

To familiarize yourself with the ASCII codes for each character take a
look at the man page for ascii:

.. code:: bash

   man ascii

and pay attention to the decimal and hex encoding of the various
characters.  You will recognize digits (ascii values 48 to 57), upper
case letters (65 to 90) and lower case letters (97 to 122).  There are
also many standard keyboard symbols as well as control characters.

In Python you convert from a character to its ASCII value with
``ord(ch)``.  You can convert the other way from an ascii value to a
char with ``chr(num)``.  For example, in the python interpreter you
could type:

.. code:: python

   >>> print(ord('e'))
   >>> print(ord('r'))
   >>> print(ord('r') - ord('e'))
   >>> print(chr(97))
   >>> print(chr(68))
   ## now look at the difference between digits and their ascii values
   >>> print(chr(51))
   >>> print(chr(51) - chr(49))


Weak crypto
===========

We will look at two types of weak cyphers: *Caesar* and *substitution*
cyphers.  We will write programs that encode text with these cyphers,
and then we will look at how to write programs to attack the cyphers
and possibly decrypt messages.


A simple Caesar encryptor
-------------------------

In a Caesar code we shift each letter in the alphabet by a given
number.  If the shift is 5 then 'a' becomes 'f', 'l' becomes 'q', 'x'
becomes 'c' and so forth.  The world 'hello' rotated by 5 becomes
'mjqqt', and 'hello' rotated by 13 becomes 'uryyb'.

This program will rotate its input by 13 characters.  Our program will
shift the ascii values and then produce the corresponding characters.
Type the program ``caesar.py`` in :numref:`listing-caesar-py` into a file
``caesar.py``.

.. _listing-caesar-py:

.. literalinclude:: caesar.py
   :language: python
   :caption: caesar.py -- simple Caesar cypher: rotate the characters
             in the input 13.

.. exercise::

   A very simple exercise: change ``caesar.py`` to take an argument
   which is the shift number for the Caesar cipher, that way it does
   not have to be 13.  You would see if ``sys.argv[1]`` is set, and if
   so use ``int(sys.argv[1])`` as your offset.

.. exercise::

   As a slightly more elaborate exercise you could take ``caesar.py``
   and adapt it to handle upper case characters as well as lower case
   ones.


Substitution ciphers
--------------------

A simple substitution makes a table between clear text letters and
encrypted letters.  They don't have have to be shifted by a fixed
amount: the shift can be different for each character, but each clear
text character will always map to the same secret encrypted letter.

Here is an example of substitution in which 'a' is mapped to 'b', 'b'
to 'o', 'c' to 'm', 'd' to 'l', and so forth up to 'z' being mapped to
'p'.  There seems to be no rhyme or reason to the mapping.  The word
'hello' woulud become 'itxxn':

.. code-block:: text

   abcdefghijklmnopqrstuvwxyz
   bomltzhiqjgxwdnsvackeuyfrp

Note that the substitute characters do not have to be letters of the
alphabet, as long as there is a bijective mapping (one-to-one
correspondence) between the substitute characters and the letters of
the alphabet.

A program which encrypts with that substitution cypher is
``substitution.py`` in :numref:`listing-substitution-py`.

.. _listing-substitution-py:

.. literalinclude:: substitution.py
   :language: python
   :caption: substitution.py -- simple substitution cypher: replace
             each character with its substitute.


A "literary" substitution cypher
--------------------------------

Edgar Allan Poe wrote a short story called "The Gold-Bug" (see
https://www.gutenberg.org/files/2147/2147-h/2147-h.htm#link2H_4_0006
and https://en.wikipedia.org/wiki/The_Gold-Bug and
http://self.gutenberg.org/articles/eng/The_Gold-Bug )

In this story a Mr. Legrand is obsessed with decrypting a cryptogram
purportedly written by the famous pirate Captain Kidd, giving
directions to his mythical buried treasure.  The cryptogram reads:

.. code-block:: text

   53‡‡†305))6*;4826)4‡.)4‡);806*;48†8
   ¶60))85;;]8*;:‡*8†83(88)5*†;46(;88*96
   *?;8)*‡(;485);5*†2:*‡(;4956*2(5*—4)8
   ¶8*;4069285);)6†8)4‡‡;1(‡9;48081;8:8‡
   1;48†85;4)485†528806*81(‡9;48;(88;4
   (‡?34;48)4‡;161;:188;‡?;


Preparing to attack substitution cyphers: frequency analysis
------------------------------------------------------------

The field of *cryptanalysis* is more than a thousand years old.  Most
classical cyphers were substitution cyphers, and the method of
*frequency analysis* is effective in attacking those.

To see how this works, let us remember the words of Mr. Legrand in
Edgar Allan Poe's "The Gold-Bug":

   “Now, in English, the letter which most frequently occurs
   is e. Afterwards, succession runs thus: a o i d h n r s t u y c f g
   l m w b k p q x z. E predominates so remarkably that an individual
   sentence of any length is rarely seen, in which it is not the
   prevailing character.

   “Here, then, we leave, in the very beginning, the groundwork for
   something more than a mere guess. The general use which may be made
   of the table is obvious [...]

The key words are "[...] groundwork for something more than a mere
guess": the happiest words for a code breaker to hear.

Let us make a *histogram plot* of the frequency with which letters
occur in English.  How do we make such a plot?  And how do we get a
statistically valid sample of English text so that we can count how
often letters typically occur?

To make the plot we can write the program ``letter_frequency.py`` in
:numref:`listing-letter_frequency-py`.

.. _listing-letter_frequency-py:

.. literalinclude:: letter_frequency.py
   :language: python
   :caption: letter_frequency.py -- calculate and plot the frequency
             of each letter.

Now let us analyze three different samples of English words with our
``letter_frequency.py`` program.

/usr/share/dict/words
~~~~~~~~~~~~~~~~~~~~~

The easiest is probably to start with ``/usr/share/dict/words`` since
this dictionary is already present in every GNU/Linux system.  We will
write a program which reads in the file, counts all the letters, and
then puts out a simple list of how much they occur.

You can see the output of this program on ``/usr/share/dict/words``:

.. command-output:: python3 crypto/letter_frequency.py /usr/share/dict/words

You will notice something strange: the letter 's' seems to occur about
as much as the letter 'e', or even a bit more.  This goes against what
Mr. Legrand had stated.  An exercise below will address this issue.

A work by Shakespeare
~~~~~~~~~~~~~~~~~~~~~

King Lear by William Shakespeare can be found at
http://www.gutenberg.org/ebooks/1128
and the plain text version is at:
http://www.gutenberg.org/cache/epub/1128/pg1128.txt

We can download its text and analyze it with:

.. code::

   wget --output-document king-lear.txt http://www.gutenberg.org/cache/epub/1128/pg1128.txt
   python3 ./letter_frequency.py king-lear.txt

.. command-output:: wget --output-document king-lear.txt http://www.gutenberg.org/cache/epub/1128/pg1128.txt
   :shell:
   :ellipsis: 0

.. command-output:: python3 ./crypto/letter_frequency.py king-lear.txt


A contemporary English book
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Looking at the books on Project Gutenberg sorted by release date at

https://www.gutenberg.org/ebooks/search/?sort_order=release_date

we can pick "An Ocean Tragedy" by William Clark Russell and download
its text and then run our program with:

.. code::

   wget --output-document ocean-tragedy.txt https://www.gutenberg.org/files/56363/56363-0.txt
   python3 ./letter_frequency.py ocean-tragedy.txt

.. command-output:: wget --output-document ocean-tragedy.txt https://www.gutenberg.org/files/56363/56363-0.txt
   :shell:
   :ellipsis: 0

.. command-output:: python3 ./crypto/letter_frequency.py ocean-tragedy.txt


Conclusion: signatures
~~~~~~~~~~~~~~~~~~~~~~

Take a close look at the three histograms, the one from
``/usr/share/dict/words`` has a certain feeling


Exercises
~~~~~~~~~

.. exercise::

   Discuss with your fellow students why the histogram of letters in
   ``/usr/share/dict/words`` does not resemble that in actual books.

.. exercise::

   Are there significant differences in the letter frequency between
   Shakespeare and the modern book?  Discuss this with your fellow
   students.

.. exercise::

   Study the differnce in letter frequency distributions between short
   and long bits of text.  Take a few single-paragraph files and look
   at their letter distribution and compare it to what you got with a
   full book.

.. exercise::

   Look at books in other languages and see what the letter
   distribution is there.  Can you identify the language a book was
   written in just by looking at the distribution of letters?  Note:
   the program ``letter_frequency.py`` will run on a european language
   file that has accented letters (like Italian or German or French)
   which are not part of ASCII, but it will do so by dropping all
   those characters.  You will want to adapt it to handle those
   special accented characters.


Applying the frequency analysis to a message
--------------------------------------------

Let's do this as a sequence of exercises:

.. exercise::

   First type a reasonably long message in English into a file.  Then
   run your ``caesar.py`` or ``substitution.py`` on that file to
   encrypt it and save the output to another file.

.. exercise::

   Email that file to a "friend" and to a hypothetical "foe".  Only
   tell your friend what the Caesar shift or the substitution map is,
   and see if the foe can crack the code.

.. exercise::

   Alternate being writer, friend and foe so that everyone in your
   group gets to decrypt a message.

.. exercise::

   Is it easier to decode short or long messages?

.. exercise::

   The *foe* should try to crack the code without knowing what the
   substitution table is.  How can she use the ``letter_frequency.py``
   program to do this?


Strong crypto
=============

We have seen how easy it is to break simple substitution cyphers.
Those patterns in the distribution of letters are a clear sign that we
are not encrypting very intelligently.

Between the 1880s and 1910s stronger methods of encryption were
invented.  The *one time pad* is a theoretically unbreakable method of
encryption.  Both sender and recipient have the same random sequence
of symbols, and each character they send gets folded in some way with
the next symbol in the one time pad.

We will implement something similar to the one time pad, but first we
need to review some aspects of binary representation of numbers.


Binary numbers, XOR, hiding the byte
------------------------------------

Remember that characters are represented by their ASCII codes.  The
ASCII code for 'g' is 103 in decimal, 0x67 in hex, and 01100111 in
binary.

We want to find a way to *fold* 'g' with some random byte (remember: a
byte is an 8-bit number).  Let's choose 01101010.  We could fold the
two together in many different ways, but we want to then be able to
separate them back out.

The approach that is usually used is to use the XOR operator.  What is
this?

We will apply the various logic operators to bits: if 0 means false
and 1 means true, then we have the following truth tables:

.. code:: text

   0 OR 0 = 0
   0 OR 1 = 1
   1 OR 0 = 1
   1 OR 1 = 1

   0 AND 0 = 0
   0 AND 1 = 0
   1 AND 0 = 0
   1 AND 1 = 1

   NOT 0 = 1
   NOT 1 = 0

Sometimes people find the behavior of OR to be counter-intuitive
because "1 OR 1 = 1".  People sometimes remember their parents saying
"you can have cake OR you can have ice cream but not both".  That
violates "1 OR 1 = 1", but parents don't usually accept this logic.
For parents and cryptographers we have another logical operation
called XOR (*exclusive* OR).  The truth table for XOR is:

.. code:: text

   0 XOR 0 = 0
   0 XOR 1 = 1
   1 XOR 0 = 1
   1 XOR 1 = 0

This was done one bit at a time.  Suppose you have the character
'g' (01100111) and our pattern 01101010.  The XOR of those two is done
bit-by-bit:

.. code:: text

   01100111  ('g')
     XOR
   01101010  (number from one time pad)
      |
      v
   00001101  (encrypted number)

The cool thing about the XOR function is that if you XOR with the same
number again you get back the original: all those bits get flipped
back:

.. code:: text

   01100111 XOR 01101010 -> 00001101 (the encrypted number)
   00001101 XOR 01101010 -> 01100111 (back to 'g'!)

What does this mean?  It means that if you know the secret number you
can encrypt your numbers, and you can also decrypt them!  And someone
who does not have have the secret number cannot make any progress.

Manipulating bits in python
---------------------------

Python offers us the "bitwise xor" operator which uses the symbol
``^``

Let us see examples of how to use it:

.. code:: python

   ## in the python interpreter, at the >>> prompt, type the following
   x = int('01100111', 2)
   y = int('01101010', 2)
   z = x ^ y
   print('%d ^ %d = %d' % (x, y, z))
   print('0x%x ^ 0x%x = 0x%x' % (x, y, z))
   print('%s ^ %s = %s' % (bin(x), bin(y), bin(z)))
   print('to restore the original number:')
   print('%s ^ %s = %s' % (bin(x), bin(z), bin(z^y)))
   print(bin(x), bin(z^y), x == z^y)


Revisiting random number generators
-----------------------------------

We have explored random numbers before, but our angle here is slightly
different.  We are now interested in how two different people can get
the *same* stream of random numbers.

First let us look at an example of how to get the first ten numbers in
a stream.  At the Python interpreter type:

.. code:: python

   import random
   for i in range(10):
        print(i, random.randint(0, 255))

Now run the same thing in a different terminal.  You will see that the
two streams are quite different, even though you are using the same
random number generator in both examples.

The reason for the difference is that the *starting point* for the
sequence is different.  This starting point is called the random
number seed, and by default it is set to a value determined by some
real time event, like the current wall clock time.

But almost all random number generators allow you to *set* the seed to
a given value, so now let us try running the following code in two
different terminals:
   
.. code:: python

   import random
   random.seed(7419)
   for i in range(10):
       print(i, random.randint(0, 255))

The output will be the same for both: when we set an identical seed,
the random sequences are identical.

Note that the seed (which I set to 7419) could be any integer -- each
different seed will produce a different sequence, but the if you use
that integer twice then you get the same sequence.

This setting of the random number seed will be crucial to allowing a
person to decrypt our message.


Implementing tougher encryption
-------------------------------

The idea is be to generate a stream of random numbers, use them to
encrypt a stream of bytes with the XOR operation.  Later we will use
that same sequence to decrypt the encrypted bytes.

The program which will encrypt text is ``pad_encrypt.py`` in
:numref:`listing-pad_encrypt-py`.

.. _listing-pad_encrypt-py:

.. literalinclude:: pad_encrypt.py
   :language: python
   :caption: pad_encrypt.py -- Simple encrypting program which uses a
             one-time pad (in our case a sequence of random numbers)
             to encrypt a file.

You can look at the result of encrypting a small file by typing a
paragraph into a text file and running ``pad_encrypt.py filename`` and
for very short snippets you can do something like ``echo "just three
words" | python3 pad_encrypt.py`` and get:

.. program-output:: echo "just three words" | python3 crypto/pad_encrypt.py
   :shell:

As you can see the program ``pad_encrypt.py`` outputs the encrypted
bytes as hex numbers that look like 0x5a.  This is because these
encrypted bytes would seldom be readable as part of a string: most of
them would not be ascii values for anything printable.

Decrypting this tougher encryption
----------------------------------

The program to decrypt text encoded by ``pad_encrypt.py`` is
``pad_decrypt.py`` in :numref:`listing-pad_decrypt-py`.

.. _listing-pad_decrypt-py:

.. literalinclude:: pad_decrypt.py
   :language: python
   :caption: pad_decrypt.py -- Simple decrypting program which uses a
             one-time pad (in our case a sequence of random numbers)
             to encrypt a file.

You will notice that ``pad_decrypt.py`` has an interesting feature: it
uses the function ``print_frequency_histogram()`` from
``letter_frequency.py`` to show us the frequency histogram for both
the encrypted and decrypted text:

.. command-output:: python3 crypto/pad_encrypt.py king-lear.txt > king-lear-encrypted-pad.txt
   :shell:
   :ellipsis: 0

.. command-output:: python3 crypto/pad_decrypt.py king-lear-encrypted-pad.txt
   :shell:
   :ellipsis: 12, 250

You should be delighted at how the encrypted text has no discernible
pattern in its frequency distribution, while the decrypted version has
the pattern we are familiar with.

This means that one *cannot* decrypt this code using frequency
analysis.

Further reading
===============

Technical details
-----------------

* https://en.wikipedia.org/wiki/XOR_cipher

Historical
----------

* https://en.wikipedia.org/wiki/Venona_project which contains the
  statement "[...] Due to a serious blunder on the part of the
  Soviets, some of this traffic was vulnerable to cryptanalysis. The
  Soviet company that manufactured the one-time pads produced around
  35,000 pages of duplicate key numbers, as a result of pressures
  brought about by the German advance on Moscow during World
  War II. [...]"

* "The Code Book", by Simon Singh

Videos
------

* Crash Course https://www.youtube.com/watch?v=jhXCTbFnK8o

* Cris Moore's talk on cryptography:
  https://www.youtube.com/watch?v=3J2RHGVf0ho