.. _crypto: ============== Cryptography ============== [status: content-mostly-written] We will discuss a narrow slice of cryptography. We will start with a weak approach, and then look at how you could get a stronger encryption with key exchange and a pseudorandom number generator. Preliminary: ASCII values ========================= A key aspect of some of the techniques we will use is that each character correponds to an integer value. In English such integers are called the *ASCII values* of characters, and they are sequential for sequential letters. ASCII stands for "American Standard Code for Information Interchange". To familiarize yourself with the ASCII codes for each character take a look at the man page for ascii: .. code:: bash man ascii and pay attention to the decimal and hex encoding of the various characters. You will recognize digits (ascii values 48 to 57), upper case letters (65 to 90) and lower case letters (97 to 122). There are also many standard keyboard symbols as well as control characters. In Python you convert from a character to its ASCII value with ``ord(ch)``. You can convert the other way from an ascii value to a char with ``chr(num)``. For example, in the python interpreter you could type: .. code:: python >>> print(ord('e')) >>> print(ord('r')) >>> print(ord('r') - ord('e')) >>> print(chr(97)) >>> print(chr(68)) ## now look at the difference between digits and their ascii values >>> print(chr(51)) >>> print(chr(51) - chr(49)) Weak crypto =========== We will look at two types of weak cyphers: *Caesar* and *substitution* cyphers. We will write programs that encode text with these cyphers, and then we will look at how to write programs to attack the cyphers and possibly decrypt messages. A simple Caesar encryptor ------------------------- In a Caesar code we shift each letter in the alphabet by a given number. If the shift is 5 then 'a' becomes 'f', 'l' becomes 'q', 'x' becomes 'c' and so forth. The world 'hello' rotated by 5 becomes 'mjqqt', and 'hello' rotated by 13 becomes 'uryyb'. This program will rotate its input by 13 characters. Our program will shift the ascii values and then produce the corresponding characters. Type the program ``caesar.py`` in :numref:`listing-caesar-py` into a file ``caesar.py``. .. _listing-caesar-py: .. literalinclude:: caesar.py :language: python :caption: caesar.py -- simple Caesar cypher: rotate the characters in the input 13. .. exercise:: A very simple exercise: change ``caesar.py`` to take an argument which is the shift number for the Caesar cipher, that way it does not have to be 13. You would see if ``sys.argv[1]`` is set, and if so use ``int(sys.argv[1])`` as your offset. .. exercise:: As a slightly more elaborate exercise you could take ``caesar.py`` and adapt it to handle upper case characters as well as lower case ones. Substitution ciphers -------------------- A simple substitution makes a table between clear text letters and encrypted letters. They don't have have to be shifted by a fixed amount: the shift can be different for each character, but each clear text character will always map to the same secret encrypted letter. Here is an example of substitution in which 'a' is mapped to 'b', 'b' to 'o', 'c' to 'm', 'd' to 'l', and so forth up to 'z' being mapped to 'p'. There seems to be no rhyme or reason to the mapping. The word 'hello' woulud become 'itxxn': .. code-block:: text abcdefghijklmnopqrstuvwxyz bomltzhiqjgxwdnsvackeuyfrp Note that the substitute characters do not have to be letters of the alphabet, as long as there is a bijective mapping (one-to-one correspondence) between the substitute characters and the letters of the alphabet. A program which encrypts with that substitution cypher is ``substitution.py`` in :numref:`listing-substitution-py`. .. _listing-substitution-py: .. literalinclude:: substitution.py :language: python :caption: substitution.py -- simple substitution cypher: replace each character with its substitute. A "literary" substitution cypher -------------------------------- Edgar Allan Poe wrote a short story called "The Gold-Bug" (see https://www.gutenberg.org/files/2147/2147-h/2147-h.htm#link2H_4_0006 and https://en.wikipedia.org/wiki/The_Gold-Bug and http://self.gutenberg.org/articles/eng/The_Gold-Bug ) In this story a Mr. Legrand is obsessed with decrypting a cryptogram purportedly written by the famous pirate Captain Kidd, giving directions to his mythical buried treasure. The cryptogram reads: .. code-block:: text 53‡‡†305))6*;4826)4‡.)4‡);806*;48†8 ¶60))85;;]8*;:‡*8†83(88)5*†;46(;88*96 *?;8)*‡(;485);5*†2:*‡(;4956*2(5*—4)8 ¶8*;4069285);)6†8)4‡‡;1(‡9;48081;8:8‡ 1;48†85;4)485†528806*81(‡9;48;(88;4 (‡?34;48)4‡;161;:188;‡?; Preparing to attack substitution cyphers: frequency analysis ------------------------------------------------------------ The field of *cryptanalysis* is more than a thousand years old. Most classical cyphers were substitution cyphers, and the method of *frequency analysis* is effective in attacking those. To see how this works, let us remember the words of Mr. Legrand in Edgar Allan Poe's "The Gold-Bug": “Now, in English, the letter which most frequently occurs is e. Afterwards, succession runs thus: a o i d h n r s t u y c f g l m w b k p q x z. E predominates so remarkably that an individual sentence of any length is rarely seen, in which it is not the prevailing character. “Here, then, we leave, in the very beginning, the groundwork for something more than a mere guess. The general use which may be made of the table is obvious [...] The key words are "[...] groundwork for something more than a mere guess": the happiest words for a code breaker to hear. Let us make a *histogram plot* of the frequency with which letters occur in English. How do we make such a plot? And how do we get a statistically valid sample of English text so that we can count how often letters typically occur? To make the plot we can write the program ``letter_frequency.py`` in :numref:`listing-letter_frequency-py`. .. _listing-letter_frequency-py: .. literalinclude:: letter_frequency.py :language: python :caption: letter_frequency.py -- calculate and plot the frequency of each letter. Now let us analyze three different samples of English words with our ``letter_frequency.py`` program. /usr/share/dict/words ~~~~~~~~~~~~~~~~~~~~~ The easiest is probably to start with ``/usr/share/dict/words`` since this dictionary is already present in every GNU/Linux system. We will write a program which reads in the file, counts all the letters, and then puts out a simple list of how much they occur. You can see the output of this program on ``/usr/share/dict/words``: .. command-output:: python3 crypto/letter_frequency.py /usr/share/dict/words You will notice something strange: the letter 's' seems to occur about as much as the letter 'e', or even a bit more. This goes against what Mr. Legrand had stated. An exercise below will address this issue. A work by Shakespeare ~~~~~~~~~~~~~~~~~~~~~ King Lear by William Shakespeare can be found at http://www.gutenberg.org/ebooks/1128 and the plain text version is at: http://www.gutenberg.org/cache/epub/1128/pg1128.txt We can download its text and analyze it with: .. code:: wget --output-document king-lear.txt http://www.gutenberg.org/cache/epub/1128/pg1128.txt python3 ./letter_frequency.py king-lear.txt .. command-output:: wget --output-document king-lear.txt http://www.gutenberg.org/cache/epub/1128/pg1128.txt :shell: :ellipsis: 0 .. command-output:: python3 ./crypto/letter_frequency.py king-lear.txt A contemporary English book ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Looking at the books on Project Gutenberg sorted by release date at https://www.gutenberg.org/ebooks/search/?sort_order=release_date we can pick "An Ocean Tragedy" by William Clark Russell and download its text and then run our program with: .. code:: wget --output-document ocean-tragedy.txt https://www.gutenberg.org/files/56363/56363-0.txt python3 ./letter_frequency.py ocean-tragedy.txt .. command-output:: wget --output-document ocean-tragedy.txt https://www.gutenberg.org/files/56363/56363-0.txt :shell: :ellipsis: 0 .. command-output:: python3 ./crypto/letter_frequency.py ocean-tragedy.txt Conclusion: signatures ~~~~~~~~~~~~~~~~~~~~~~ Take a close look at the three histograms, the one from ``/usr/share/dict/words`` has a certain feeling Exercises ~~~~~~~~~ .. exercise:: Discuss with your fellow students why the histogram of letters in ``/usr/share/dict/words`` does not resemble that in actual books. .. exercise:: Are there significant differences in the letter frequency between Shakespeare and the modern book? Discuss this with your fellow students. .. exercise:: Study the differnce in letter frequency distributions between short and long bits of text. Take a few single-paragraph files and look at their letter distribution and compare it to what you got with a full book. .. exercise:: Look at books in other languages and see what the letter distribution is there. Can you identify the language a book was written in just by looking at the distribution of letters? Note: the program ``letter_frequency.py`` will run on a european language file that has accented letters (like Italian or German or French) which are not part of ASCII, but it will do so by dropping all those characters. You will want to adapt it to handle those special accented characters. Applying the frequency analysis to a message -------------------------------------------- Let's do this as a sequence of exercises: .. exercise:: First type a reasonably long message in English into a file. Then run your ``caesar.py`` or ``substitution.py`` on that file to encrypt it and save the output to another file. .. exercise:: Email that file to a "friend" and to a hypothetical "foe". Only tell your friend what the Caesar shift or the substitution map is, and see if the foe can crack the code. .. exercise:: Alternate being writer, friend and foe so that everyone in your group gets to decrypt a message. .. exercise:: Is it easier to decode short or long messages? .. exercise:: The *foe* should try to crack the code without knowing what the substitution table is. How can she use the ``letter_frequency.py`` program to do this? Strong crypto ============= We have seen how easy it is to break simple substitution cyphers. Those patterns in the distribution of letters are a clear sign that we are not encrypting very intelligently. Between the 1880s and 1910s stronger methods of encryption were invented. The *one time pad* is a theoretically unbreakable method of encryption. Both sender and recipient have the same random sequence of symbols, and each character they send gets folded in some way with the next symbol in the one time pad. We will implement something similar to the one time pad, but first we need to review some aspects of binary representation of numbers. Binary numbers, XOR, hiding the byte ------------------------------------ Remember that characters are represented by their ASCII codes. The ASCII code for 'g' is 103 in decimal, 0x67 in hex, and 01100111 in binary. We want to find a way to *fold* 'g' with some random byte (remember: a byte is an 8-bit number). Let's choose 01101010. We could fold the two together in many different ways, but we want to then be able to separate them back out. The approach that is usually used is to use the XOR operator. What is this? We will apply the various logic operators to bits: if 0 means false and 1 means true, then we have the following truth tables: .. code:: text 0 OR 0 = 0 0 OR 1 = 1 1 OR 0 = 1 1 OR 1 = 1 0 AND 0 = 0 0 AND 1 = 0 1 AND 0 = 0 1 AND 1 = 1 NOT 0 = 1 NOT 1 = 0 Sometimes people find the behavior of OR to be counter-intuitive because "1 OR 1 = 1". People sometimes remember their parents saying "you can have cake OR you can have ice cream but not both". That violates "1 OR 1 = 1", but parents don't usually accept this logic. For parents and cryptographers we have another logical operation called XOR (*exclusive* OR). The truth table for XOR is: .. code:: text 0 XOR 0 = 0 0 XOR 1 = 1 1 XOR 0 = 1 1 XOR 1 = 0 This was done one bit at a time. Suppose you have the character 'g' (01100111) and our pattern 01101010. The XOR of those two is done bit-by-bit: .. code:: text 01100111 ('g') XOR 01101010 (number from one time pad) | v 00001101 (encrypted number) The cool thing about the XOR function is that if you XOR with the same number again you get back the original: all those bits get flipped back: .. code:: text 01100111 XOR 01101010 -> 00001101 (the encrypted number) 00001101 XOR 01101010 -> 01100111 (back to 'g'!) What does this mean? It means that if you know the secret number you can encrypt your numbers, and you can also decrypt them! And someone who does not have have the secret number cannot make any progress. Manipulating bits in python --------------------------- Python offers us the "bitwise xor" operator which uses the symbol ``^`` Let us see examples of how to use it: .. code:: python ## in the python interpreter, at the >>> prompt, type the following x = int('01100111', 2) y = int('01101010', 2) z = x ^ y print('%d ^ %d = %d' % (x, y, z)) print('0x%x ^ 0x%x = 0x%x' % (x, y, z)) print('%s ^ %s = %s' % (bin(x), bin(y), bin(z))) print('to restore the original number:') print('%s ^ %s = %s' % (bin(x), bin(z), bin(z^y))) print(bin(x), bin(z^y), x == z^y) Revisiting random number generators ----------------------------------- We have explored random numbers before, but our angle here is slightly different. We are now interested in how two different people can get the *same* stream of random numbers. First let us look at an example of how to get the first ten numbers in a stream. At the Python interpreter type: .. code:: python import random for i in range(10): print(i, random.randint(0, 255)) Now run the same thing in a different terminal. You will see that the two streams are quite different, even though you are using the same random number generator in both examples. The reason for the difference is that the *starting point* for the sequence is different. This starting point is called the random number seed, and by default it is set to a value determined by some real time event, like the current wall clock time. But almost all random number generators allow you to *set* the seed to a given value, so now let us try running the following code in two different terminals: .. code:: python import random random.seed(7419) for i in range(10): print(i, random.randint(0, 255)) The output will be the same for both: when we set an identical seed, the random sequences are identical. Note that the seed (which I set to 7419) could be any integer -- each different seed will produce a different sequence, but the if you use that integer twice then you get the same sequence. This setting of the random number seed will be crucial to allowing a person to decrypt our message. Implementing tougher encryption ------------------------------- The idea is be to generate a stream of random numbers, use them to encrypt a stream of bytes with the XOR operation. Later we will use that same sequence to decrypt the encrypted bytes. The program which will encrypt text is ``pad_encrypt.py`` in :numref:`listing-pad_encrypt-py`. .. _listing-pad_encrypt-py: .. literalinclude:: pad_encrypt.py :language: python :caption: pad_encrypt.py -- Simple encrypting program which uses a one-time pad (in our case a sequence of random numbers) to encrypt a file. You can look at the result of encrypting a small file by typing a paragraph into a text file and running ``pad_encrypt.py filename`` and for very short snippets you can do something like ``echo "just three words" | python3 pad_encrypt.py`` and get: .. program-output:: echo "just three words" | python3 crypto/pad_encrypt.py :shell: As you can see the program ``pad_encrypt.py`` outputs the encrypted bytes as hex numbers that look like 0x5a. This is because these encrypted bytes would seldom be readable as part of a string: most of them would not be ascii values for anything printable. Decrypting this tougher encryption ---------------------------------- The program to decrypt text encoded by ``pad_encrypt.py`` is ``pad_decrypt.py`` in :numref:`listing-pad_decrypt-py`. .. _listing-pad_decrypt-py: .. literalinclude:: pad_decrypt.py :language: python :caption: pad_decrypt.py -- Simple decrypting program which uses a one-time pad (in our case a sequence of random numbers) to encrypt a file. You will notice that ``pad_decrypt.py`` has an interesting feature: it uses the function ``print_frequency_histogram()`` from ``letter_frequency.py`` to show us the frequency histogram for both the encrypted and decrypted text: .. command-output:: python3 crypto/pad_encrypt.py king-lear.txt > king-lear-encrypted-pad.txt :shell: :ellipsis: 0 .. command-output:: python3 crypto/pad_decrypt.py king-lear-encrypted-pad.txt :shell: :ellipsis: 12, 250 You should be delighted at how the encrypted text has no discernible pattern in its frequency distribution, while the decrypted version has the pattern we are familiar with. This means that one *cannot* decrypt this code using frequency analysis. Further reading =============== Technical details ----------------- * https://en.wikipedia.org/wiki/XOR_cipher Historical ---------- * https://en.wikipedia.org/wiki/Venona_project which contains the statement "[...] Due to a serious blunder on the part of the Soviets, some of this traffic was vulnerable to cryptanalysis. The Soviet company that manufactured the one-time pads produced around 35,000 pages of duplicate key numbers, as a result of pressures brought about by the German advance on Moscow during World War II. [...]" * "The Code Book", by Simon Singh Videos ------ * Crash Course https://www.youtube.com/watch?v=jhXCTbFnK8o * Cris Moore's talk on cryptography: https://www.youtube.com/watch?v=3J2RHGVf0ho