34. Cryptography

[status: content-mostly-written]

We will discuss a narrow slice of cryptography. We will start with a weak approach, and then look at how you could get a stronger encryption with key exchange and a pseudorandom number generator.

34.1. Preliminary: ASCII values

A key aspect of some of the techniques we will use is that each character correponds to an integer value. In English such integers are called the ASCII values of characters, and they are sequential for sequential letters. ASCII stands for “American Standard Code for Information Interchange”.

To familiarize yourself with the ASCII codes for each character take a look at the man page for ascii:

man ascii

and pay attention to the decimal and hex encoding of the various characters. You will recognize digits (ascii values 48 to 57), upper case letters (65 to 90) and lower case letters (97 to 122). There are also many standard keyboard symbols as well as control characters.

In Python you convert from a character to its ASCII value with ord(ch). You can convert the other way from an ascii value to a char with chr(num). For example, in the python interpreter you could type:

>>> print(ord('e'))
>>> print(ord('r'))
>>> print(ord('r') - ord('e'))
>>> print(chr(97))
>>> print(chr(68))
## now look at the difference between digits and their ascii values
>>> print(chr(51))
>>> print(chr(51) - chr(49))

34.2. Weak crypto

We will look at two types of weak cyphers: Caesar and substitution cyphers. We will write programs that encode text with these cyphers, and then we will look at how to write programs to attack the cyphers and possibly decrypt messages.

34.2.1. A simple Caesar encryptor

In a Caesar code we shift each letter in the alphabet by a given number. If the shift is 5 then ‘a’ becomes ‘f’, ‘l’ becomes ‘q’, ‘x’ becomes ‘c’ and so forth. The world ‘hello’ rotated by 5 becomes ‘mjqqt’, and ‘hello’ rotated by 13 becomes ‘uryyb’.

This program will rotate its input by 13 characters. Our program will shift the ascii values and then produce the corresponding characters. Type the program caesar.py in Listing 34.2.1 into a file caesar.py.

Listing 34.2.1 caesar.py – simple Caesar cypher: rotate the characters in the input 13.
#! /usr/bin/env python3
import sys

def main():
    shift = 13
    encrypted_line = ''
    fname = ''
    if len(sys.argv) > 2:
        print('error: usage is "%s [filepath]"' % sys.argv[0])
        sys.exit(1)
    elif len(sys.argv) == 2:
        fname = sys.argv[1]
        f = open(fname, 'r')
    else:
        fname = 'standard-input'
        f = sys.stdin
    ## now go through the file, one character at a time
    for c in f.read():
        if c < 'a' or c > 'z':  # we only handle 'a' <= c <= 'z'
            encrypted_line = encrypted_line + c # add it with no encryption
            continue
        ascii_value = ord(c)
        ## find the new character by shifting this character
        crypt_char = ascii_value + 13
        if crypt_char > ord('z'):
            ## if we go beyond z we wrap back around to 'a'
            crypt_char = (crypt_char - ord('a')) % 26 + ord('a')
        # print(c, '->', crypt_char)
        encrypted_line = encrypted_line + chr(crypt_char)
    print('encrypted: ', encrypted_line)

main()
Exercise 34.1

A very simple exercise: change caesar.py to take an argument which is the shift number for the Caesar cipher, that way it does not have to be 13. You would see if sys.argv[1] is set, and if so use int(sys.argv[1]) as your offset.

Exercise 34.2

As a slightly more elaborate exercise you could take caesar.py and adapt it to handle upper case characters as well as lower case ones.

34.2.2. Substitution ciphers

A simple substitution makes a table between clear text letters and encrypted letters. They don’t have have to be shifted by a fixed amount: the shift can be different for each character, but each clear text character will always map to the same secret encrypted letter.

Here is an example of substitution in which ‘a’ is mapped to ‘b’, ‘b’ to ‘o’, ‘c’ to ‘m’, ‘d’ to ‘l’, and so forth up to ‘z’ being mapped to ‘p’. There seems to be no rhyme or reason to the mapping. The word ‘hello’ woulud become ‘itxxn’:

abcdefghijklmnopqrstuvwxyz
bomltzhiqjgxwdnsvackeuyfrp

Note that the substitute characters do not have to be letters of the alphabet, as long as there is a bijective mapping (one-to-one correspondence) between the substitute characters and the letters of the alphabet.

A program which encrypts with that substitution cypher is substitution.py in Listing 34.2.2.

Listing 34.2.2 substitution.py – simple substitution cypher: replace each character with its substitute.
#! /usr/bin/env python3
import sys

def main():
    original   = list('abcdefghijklmnopqrstuvwxyz')
    #substitute = list('bomltzhiqjgxwdnsvackeuyfrp')
    substitute = list('omltzniqjgxwdhsvackeuyfrpb')

    encrypted_line = ''
    fname = ''
    if len(sys.argv) > 2:
        print('error: usage is "%s [filepath]"' % sys.argv[0])
        sys.exit(1)
    elif len(sys.argv) == 2:
        fname = sys.argv[1]
        f = open(fname, 'r')
    else:
        fname = 'standard-input'
        f = sys.stdin
    ## now go through the file, one character at a time
    for c in f.read():
        if c < 'a' or c > 'z':  # we only handle 'a' <= c <= 'z'
            encrypted_line = encrypted_line + c
            continue
        ## pos is the position in the original list
        pos = ord(c) - ord('a')
        ## find the new character by taking that position in the
        ## substitute list
        crypt_char = substitute[pos]
        # print(c, '->', crypt_char)
        encrypted_line = encrypted_line + crypt_char
    print('encrypted: ', encrypted_line)

main()

34.2.3. A “literary” substitution cypher

Edgar Allan Poe wrote a short story called “The Gold-Bug” (see https://www.gutenberg.org/files/2147/2147-h/2147-h.htm#link2H_4_0006 and https://en.wikipedia.org/wiki/The_Gold-Bug and http://self.gutenberg.org/articles/eng/The_Gold-Bug )

In this story a Mr. Legrand is obsessed with decrypting a cryptogram purportedly written by the famous pirate Captain Kidd, giving directions to his mythical buried treasure. The cryptogram reads:

53‡‡†305))6*;4826)4‡.)4‡);806*;48†8
¶60))85;;]8*;:‡*8†83(88)5*†;46(;88*96
*?;8)*‡(;485);5*†2:*‡(;4956*2(5*—4)8
¶8*;4069285);)6†8)4‡‡;1(‡9;48081;8:8‡
1;48†85;4)485†528806*81(‡9;48;(88;4
(‡?34;48)4‡;161;:188;‡?;

34.2.4. Preparing to attack substitution cyphers: frequency analysis

The field of cryptanalysis is more than a thousand years old. Most classical cyphers were substitution cyphers, and the method of frequency analysis is effective in attacking those.

To see how this works, let us remember the words of Mr. Legrand in Edgar Allan Poe’s “The Gold-Bug”:

“Now, in English, the letter which most frequently occurs is e. Afterwards, succession runs thus: a o i d h n r s t u y c f g l m w b k p q x z. E predominates so remarkably that an individual sentence of any length is rarely seen, in which it is not the prevailing character.

“Here, then, we leave, in the very beginning, the groundwork for something more than a mere guess. The general use which may be made of the table is obvious […]

The key words are “[…] groundwork for something more than a mere guess”: the happiest words for a code breaker to hear.

Let us make a histogram plot of the frequency with which letters occur in English. How do we make such a plot? And how do we get a statistically valid sample of English text so that we can count how often letters typically occur?

To make the plot we can write the program letter_frequency.py in Listing 34.2.3.

Listing 34.2.3 letter_frequency.py – calculate and plot the frequency of each letter.
#! /usr/bin/env python3

import sys

def main():
    fname = ''
    if len(sys.argv) > 2:
        print('error: usage is "%s [filepath]"' % sys.argv[0])
        sys.exit(1)
    elif len(sys.argv) == 2:
        fname = sys.argv[1]
        f = open(fname, 'r')
    else:
        fname = 'standard-input'
        f = sys.stdin

    letter_count = [0]*26
    for c in f.read():
        if not c.isalpha(): # special cases: character is not alphabetic
            continue
        if ord(c) > 127:        # or not ascii
            continue
        ## special case: character is uppercase, so we fold it down to
        ## lowercase
        if c.isupper():
            c = c.lower()
        ## now we have a lower chase char, so we find its position in
        ## the alphabet
        pos = ord(c) - ord('a')
        ## now increase the letter_count histogram for that position
        letter_count[pos] += 1
    f.close()
    print_frequency_histogram(fname, letter_count, offset=ord('a'))

def print_frequency_histogram(title, hist, offset=0):
    print('====== frequency histogram for %s ======' % title)
    for i in range(len(hist)):  # simple ascii table+histogram
        c = chr(i+offset)
        if not (i + offset in range(ord('A'), 1+ord('Z'))
                or i + offset in range(ord('a'), 1+ord('z'))
                or i + offset in range(ord('0'), 1+ord('9'))):
            c = ' '
        print('%3d -- %s %8d --  '
              % (i, c, hist[i]), end="")
              # % (i, chr(i + ord('a')), hist[i]), end="")
        ## now print a histogram of the letter X, but scale it down so
        ## the max has some 50 Xs
        n_Xs_to_print = float(hist[i]) * 50.0 / max(hist)
        for j in range(int(n_Xs_to_print)):
            print('X', end="")
        print()

if __name__ == '__main__':
    main()

Now let us analyze three different samples of English words with our letter_frequency.py program.

34.2.4.1. /usr/share/dict/words

The easiest is probably to start with /usr/share/dict/words since this dictionary is already present in every GNU/Linux system. We will write a program which reads in the file, counts all the letters, and then puts out a simple list of how much they occur.

You can see the output of this program on /usr/share/dict/words:

====== frequency histogram for /usr/share/dict/words ======
  0 -- a    67956 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  1 -- b    16446 --  XXXXXXXX
  2 -- c    33242 --  XXXXXXXXXXXXXXXXX
  3 -- d    29683 --  XXXXXXXXXXXXXXX
  4 -- e    92097 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  5 -- f    11146 --  XXXXX
  6 -- g    23682 --  XXXXXXXXXXXX
  7 -- h    20490 --  XXXXXXXXXX
  8 -- i    69461 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  9 -- j     2080 --  X
 10 -- k     9057 --  XXXX
 11 -- l    43064 --  XXXXXXXXXXXXXXXXXXXXXX
 12 -- m    23656 --  XXXXXXXXXXXX
 13 -- n    59577 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 14 -- o    51269 --  XXXXXXXXXXXXXXXXXXXXXXXXXX
 15 -- p    23100 --  XXXXXXXXXXXX
 16 -- q     1604 --  
 17 -- r    59717 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 18 -- s    95874 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 19 -- t    54763 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXX
 20 -- u    27214 --  XXXXXXXXXXXXXX
 21 -- v     8436 --  XXXX
 22 -- w     8002 --  XXXX
 23 -- x     2312 --  X
 24 -- y    13164 --  XXXXXX
 25 -- z     3478 --  X

You will notice something strange: the letter ‘s’ seems to occur about as much as the letter ‘e’, or even a bit more. This goes against what Mr. Legrand had stated. An exercise below will address this issue.

34.2.4.2. A work by Shakespeare

King Lear by William Shakespeare can be found at http://www.gutenberg.org/ebooks/1128 and the plain text version is at: http://www.gutenberg.org/cache/epub/1128/pg1128.txt

We can download its text and analyze it with:

wget --output-document king-lear.txt http://www.gutenberg.org/cache/epub/1128/pg1128.txt
python3 ./letter_frequency.py king-lear.txt
...
====== frequency histogram for king-lear.txt ======
  0 -- a     8373 --  XXXXXXXXXXXXXXXXXXXXXXXXX
  1 -- b     1734 --  XXXXX
  2 -- c     2602 --  XXXXXXXX
  3 -- d     4519 --  XXXXXXXXXXXXX
  4 -- e    16159 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  5 -- f     2512 --  XXXXXXX
  6 -- g     2580 --  XXXXXXX
  7 -- h     7046 --  XXXXXXXXXXXXXXXXXXXXX
  8 -- i     7300 --  XXXXXXXXXXXXXXXXXXXXXX
  9 -- j       89 --  
 10 -- k     1290 --  XXX
 11 -- l     5150 --  XXXXXXXXXXXXXXX
 12 -- m     3244 --  XXXXXXXXXX
 13 -- n     7317 --  XXXXXXXXXXXXXXXXXXXXXX
 14 -- o     9751 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 15 -- p     1841 --  XXXXX
 16 -- q       78 --  
 17 -- r     7703 --  XXXXXXXXXXXXXXXXXXXXXXX
 18 -- s     7150 --  XXXXXXXXXXXXXXXXXXXXXX
 19 -- t    10796 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 20 -- u     4494 --  XXXXXXXXXXXXX
 21 -- v      481 --  X
 22 -- w     2726 --  XXXXXXXX
 23 -- x      120 --  
 24 -- y     2794 --  XXXXXXXX
 25 -- z       34 --

34.2.4.3. A contemporary English book

Looking at the books on Project Gutenberg sorted by release date at

https://www.gutenberg.org/ebooks/search/?sort_order=release_date

we can pick “An Ocean Tragedy” by William Clark Russell and download its text and then run our program with:

wget --output-document ocean-tragedy.txt https://www.gutenberg.org/files/56363/56363-0.txt
python3 ./letter_frequency.py ocean-tragedy.txt
...
====== frequency histogram for ocean-tragedy.txt ======
  0 -- a    64744 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  1 -- b    12390 --  XXXXXX
  2 -- c    18785 --  XXXXXXXXX
  3 -- d    34787 --  XXXXXXXXXXXXXXXXXX
  4 -- e    95677 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  5 -- f    20728 --  XXXXXXXXXX
  6 -- g    18775 --  XXXXXXXXX
  7 -- h    53974 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXX
  8 -- i    59395 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  9 -- j      973 --  
 10 -- k     7401 --  XXX
 11 -- l    34322 --  XXXXXXXXXXXXXXXXX
 12 -- m    19202 --  XXXXXXXXXX
 13 -- n    54177 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXX
 14 -- o    60199 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 15 -- p    13126 --  XXXXXX
 16 -- q      820 --  
 17 -- r    44082 --  XXXXXXXXXXXXXXXXXXXXXXX
 18 -- s    51953 --  XXXXXXXXXXXXXXXXXXXXXXXXXXX
 19 -- t    71954 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 20 -- u    22794 --  XXXXXXXXXXX
 21 -- v     6639 --  XXX
 22 -- w    21089 --  XXXXXXXXXXX
 23 -- x     1308 --  
 24 -- y    14615 --  XXXXXXX
 25 -- z      586 --

34.2.4.4. Conclusion: signatures

Take a close look at the three histograms, the one from /usr/share/dict/words has a certain feeling

34.2.4.5. Exercises

Exercise 34.3

Discuss with your fellow students why the histogram of letters in /usr/share/dict/words does not resemble that in actual books.

Exercise 34.4

Are there significant differences in the letter frequency between Shakespeare and the modern book? Discuss this with your fellow students.

Exercise 34.5

Study the differnce in letter frequency distributions between short and long bits of text. Take a few single-paragraph files and look at their letter distribution and compare it to what you got with a full book.

Exercise 34.6

Look at books in other languages and see what the letter distribution is there. Can you identify the language a book was written in just by looking at the distribution of letters? Note: the program letter_frequency.py will run on a european language file that has accented letters (like Italian or German or French) which are not part of ASCII, but it will do so by dropping all those characters. You will want to adapt it to handle those special accented characters.

34.2.5. Applying the frequency analysis to a message

Let’s do this as a sequence of exercises:

Exercise 34.7

First type a reasonably long message in English into a file. Then run your caesar.py or substitution.py on that file to encrypt it and save the output to another file.

Exercise 34.8

Email that file to a “friend” and to a hypothetical “foe”. Only tell your friend what the Caesar shift or the substitution map is, and see if the foe can crack the code.

Exercise 34.9

Alternate being writer, friend and foe so that everyone in your group gets to decrypt a message.

Exercise 34.10

Is it easier to decode short or long messages?

Exercise 34.11

The foe should try to crack the code without knowing what the substitution table is. How can she use the letter_frequency.py program to do this?

34.3. Strong crypto

We have seen how easy it is to break simple substitution cyphers. Those patterns in the distribution of letters are a clear sign that we are not encrypting very intelligently.

Between the 1880s and 1910s stronger methods of encryption were invented. The one time pad is a theoretically unbreakable method of encryption. Both sender and recipient have the same random sequence of symbols, and each character they send gets folded in some way with the next symbol in the one time pad.

We will implement something similar to the one time pad, but first we need to review some aspects of binary representation of numbers.

34.3.1. Binary numbers, XOR, hiding the byte

Remember that characters are represented by their ASCII codes. The ASCII code for ‘g’ is 103 in decimal, 0x67 in hex, and 01100111 in binary.

We want to find a way to fold ‘g’ with some random byte (remember: a byte is an 8-bit number). Let’s choose 01101010. We could fold the two together in many different ways, but we want to then be able to separate them back out.

The approach that is usually used is to use the XOR operator. What is this?

We will apply the various logic operators to bits: if 0 means false and 1 means true, then we have the following truth tables:

0 OR 0 = 0
0 OR 1 = 1
1 OR 0 = 1
1 OR 1 = 1

0 AND 0 = 0
0 AND 1 = 0
1 AND 0 = 0
1 AND 1 = 1

NOT 0 = 1
NOT 1 = 0

Sometimes people find the behavior of OR to be counter-intuitive because “1 OR 1 = 1”. People sometimes remember their parents saying “you can have cake OR you can have ice cream but not both”. That violates “1 OR 1 = 1”, but parents don’t usually accept this logic. For parents and cryptographers we have another logical operation called XOR (exclusive OR). The truth table for XOR is:

0 XOR 0 = 0
0 XOR 1 = 1
1 XOR 0 = 1
1 XOR 1 = 0

This was done one bit at a time. Suppose you have the character ‘g’ (01100111) and our pattern 01101010. The XOR of those two is done bit-by-bit:

01100111  ('g')
  XOR
01101010  (number from one time pad)
   |
   v
00001101  (encrypted number)

The cool thing about the XOR function is that if you XOR with the same number again you get back the original: all those bits get flipped back:

01100111 XOR 01101010 -> 00001101 (the encrypted number)
00001101 XOR 01101010 -> 01100111 (back to 'g'!)

What does this mean? It means that if you know the secret number you can encrypt your numbers, and you can also decrypt them! And someone who does not have have the secret number cannot make any progress.

34.3.2. Manipulating bits in python

Python offers us the “bitwise xor” operator which uses the symbol ^

Let us see examples of how to use it:

## in the python interpreter, at the >>> prompt, type the following
x = int('01100111', 2)
y = int('01101010', 2)
z = x ^ y
print('%d ^ %d = %d' % (x, y, z))
print('0x%x ^ 0x%x = 0x%x' % (x, y, z))
print('%s ^ %s = %s' % (bin(x), bin(y), bin(z)))
print('to restore the original number:')
print('%s ^ %s = %s' % (bin(x), bin(z), bin(z^y)))
print(bin(x), bin(z^y), x == z^y)

34.3.3. Revisiting random number generators

We have explored random numbers before, but our angle here is slightly different. We are now interested in how two different people can get the same stream of random numbers.

First let us look at an example of how to get the first ten numbers in a stream. At the Python interpreter type:

import random
for i in range(10):
     print(i, random.randint(0, 255))

Now run the same thing in a different terminal. You will see that the two streams are quite different, even though you are using the same random number generator in both examples.

The reason for the difference is that the starting point for the sequence is different. This starting point is called the random number seed, and by default it is set to a value determined by some real time event, like the current wall clock time.

But almost all random number generators allow you to set the seed to a given value, so now let us try running the following code in two different terminals:

import random
random.seed(7419)
for i in range(10):
    print(i, random.randint(0, 255))

The output will be the same for both: when we set an identical seed, the random sequences are identical.

Note that the seed (which I set to 7419) could be any integer – each different seed will produce a different sequence, but the if you use that integer twice then you get the same sequence.

This setting of the random number seed will be crucial to allowing a person to decrypt our message.

34.3.4. Implementing tougher encryption

The idea is be to generate a stream of random numbers, use them to encrypt a stream of bytes with the XOR operation. Later we will use that same sequence to decrypt the encrypted bytes.

The program which will encrypt text is pad_encrypt.py in Listing 34.3.1.

Listing 34.3.1 pad_encrypt.py – Simple encrypting program which uses a one-time pad (in our case a sequence of random numbers) to encrypt a file.
#! /usr/bin/env python3
import sys
import random

def main():
    ## we need a known sequence, so we use a known random number seed
    seed = 1234
    random.seed(seed)
    fname = ''
    if len(sys.argv) > 2:
        print('error: usage is "%s [filepath]"' % sys.argv[0])
        sys.exit(1)
    elif len(sys.argv) == 2:
        fname = sys.argv[1]
        f = open(fname, 'r')
    else:
        fname = 'standard-input'
        f = sys.stdin
    ## now process the file
    encrypted_text = ''
    for c in f.read():
        # print(c, ord(c), bin(ord(c)))
        ascii_value = ord(c)    # turn char into an integer
        if ascii_value > 255:
            continue
        next_random = random.randint(0, 255) # get next random number
        ## find the new character by XORing it with the random number
        crypt_char = ascii_value ^ next_random
        ## add the encrypted character to the stream of encrypted text
        encrypted_text = encrypted_text + chr(crypt_char)
    dump_bytes_as_hex(encrypted_text)

def dump_bytes_as_hex(bytes, show_ascii=False):
    """print out a stream of bytes as 2-digit hex bytes; optionally
    also show if there are valid ASCII substrings"""
    hex_line = ''
    ascii_line = ''
    for i in range(len(bytes)):
        if i > 0 and i % 15 == 0:
            print_line(hex_line, ascii_line, show_ascii)
            hex_line = ''
            ascii_line = ''
        hex_line += ' 0x%02x' % ord(bytes[i])
        if bytes[i].isalnum():
            ascii_line += bytes[i]
        else:
            ascii_line += '.'
    ## print the final bytes
    print_line(hex_line, ascii_line, show_ascii)

def print_line(hex_line, ascii_line, show_ascii):
    print(hex_line, end="")
    if show_ascii:
        ## attempt at being "clever": print 4 spaces plus some extra
        ## if the line is shorter, so that the printable part is
        ## always aligned
        print(' '*(4 + 5*(15 - len(ascii_line))), end="")
        print(ascii_line, end="")
    print()

main()

You can look at the result of encrypting a small file by typing a paragraph into a text file and running pad_encrypt.py filename and for very short snippets you can do something like echo "just three words" | python3 pad_encrypt.py and get:

 0x8b 0x4e 0x70 0x5a 0x31 0x5e 0x5a 0xc7 0x1c 0x6d 0x2f 0x7f 0xde 0x85 0x89
 0x3f 0x24

As you can see the program pad_encrypt.py outputs the encrypted bytes as hex numbers that look like 0x5a. This is because these encrypted bytes would seldom be readable as part of a string: most of them would not be ascii values for anything printable.

34.3.5. Decrypting this tougher encryption

The program to decrypt text encoded by pad_encrypt.py is pad_decrypt.py in Listing 34.3.2.

Listing 34.3.2 pad_decrypt.py – Simple decrypting program which uses a one-time pad (in our case a sequence of random numbers) to encrypt a file.
#! /usr/bin/env python3
import sys
import random

def main():
    ## we need a known sequence, so we use a known random number seed
    seed = 1234
    random.seed(seed)
    fname = ''
    if len(sys.argv) > 2:
        print('error: usage is "%s [filepath]"' % sys.argv[0])
        sys.exit(1)
    elif len(sys.argv) == 2:
        fname = sys.argv[1]
        f = open(fname, 'r')
    else:
        fname = 'standard-input'
        f = sys.stdin
    encrypted_letter_count = [0]*256
    decrypted_letter_count = [0]*26
    decrypted_text = ''
    for line in f.readlines():
        words = line.split()
        ## extract the encrypted characters
        for word in words:
            crypt_char = int(word, 16)
            encrypted_letter_count[crypt_char] += 1 # track a histogram
            next_random = random.randint(0, 255) # get next random number
            ascii_value = crypt_char ^ next_random
            if ascii_value >= ord('a') and ascii_value <= ord('z'):
                decrypted_letter_count[ascii_value-ord('a')] += 1 # track a histogram
            decrypted_char = chr(ascii_value)
            ## add the decrypted character to the stream of decrypted text
            decrypted_text = decrypted_text + decrypted_char
    from letter_frequency import print_frequency_histogram
    print_frequency_histogram('king lear encrypted',
                              encrypted_letter_count)
    print()
    print_frequency_histogram('king lear decrypted',
                              decrypted_letter_count, offset=ord('a'))
    ## now write the output to a file ending in _decrypted
    fout = fname + '_decrypted'
    open(fout, 'w').write(decrypted_text)
    print('output written to %s' % fout)

def dump_bytes_as_hex(bytes, show_ascii=False):
    """print out a stream of bytes as 2-digit hex bytes; optionally
    also show if there are valid ASCII substrings"""
    hex_line = ''
    ascii_line = ''
    for i in range(len(bytes)):
        if i > 0 and i % 15 == 0:
            print_stuff(hex_line, ascii_line, show_ascii)
            hex_line = ''
            ascii_line = ''
        hex_line += ' 0x%02x' % ord(bytes[i])
        if bytes[i].isalnum():
            ascii_line += bytes[i]
        else:
            ascii_line += '.'
    ## print the final bytes
    print_stuff(hex_line, ascii_line, show_ascii)

def print_stuff(hex_line, ascii_line, show_ascii):
    print(hex_line, end="")
    if show_ascii:
        print(' '*(4 + 5*(15 - len(ascii_line))), end="")
        print(ascii_line, end="")
    print()

if __name__ == '__main__':
    main()

You will notice that pad_decrypt.py has an interesting feature: it uses the function print_frequency_histogram() from letter_frequency.py to show us the frequency histogram for both the encrypted and decrypted text:

...
====== frequency histogram for king lear encrypted ======
  0 --        660 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  1 --        549 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  2 --        579 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  3 --        591 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  4 --        618 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  5 --        673 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  6 --        631 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  7 --        612 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  8 --        611 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  9 --        603 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 10 --        648 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
...
249 --        598 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
250 --        611 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
251 --        612 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
252 --        615 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
253 --        570 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
254 --        647 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
255 --        578 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

====== frequency histogram for king lear decrypted ======
  0 -- a     7882 --  XXXXXXXXXXXXXXXXXXXXXXXXX
  1 -- b     1345 --  XXXX
  2 -- c     2220 --  XXXXXXX
  3 -- d     4274 --  XXXXXXXXXXXXX
  4 -- e    15707 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  5 -- f     2050 --  XXXXXX
  6 -- g     2064 --  XXXXXX
  7 -- h     6733 --  XXXXXXXXXXXXXXXXXXXXX
  8 -- i     6345 --  XXXXXXXXXXXXXXXXXXXX
  9 -- j       85 --  
 10 -- k     1039 --  XXX
 11 -- l     4586 --  XXXXXXXXXXXXXX
 12 -- m     2948 --  XXXXXXXXX
 13 -- n     7042 --  XXXXXXXXXXXXXXXXXXXXXX
 14 -- o     9503 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 15 -- p     1602 --  XXXXX
 16 -- q       61 --  
 17 -- r     7450 --  XXXXXXXXXXXXXXXXXXXXXXX
 18 -- s     6516 --  XXXXXXXXXXXXXXXXXXXX
 19 -- t    10001 --  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 20 -- u     4442 --  XXXXXXXXXXXXXX
 21 -- v      414 --  X
 22 -- w     2320 --  XXXXXXX
 23 -- x      118 --  
 24 -- y     2663 --  XXXXXXXX
 25 -- z       31 --  
output written to king-lear-encrypted-pad.txt_decrypted

You should be delighted at how the encrypted text has no discernible pattern in its frequency distribution, while the decrypted version has the pattern we are familiar with.

This means that one cannot decrypt this code using frequency analysis.

34.4. Further reading

34.4.1. Technical details

34.4.2. Historical

  • https://en.wikipedia.org/wiki/Venona_project which contains the statement “[…] Due to a serious blunder on the part of the Soviets, some of this traffic was vulnerable to cryptanalysis. The Soviet company that manufactured the one-time pads produced around 35,000 pages of duplicate key numbers, as a result of pressures brought about by the German advance on Moscow during World War II. […]”

  • “The Code Book”, by Simon Singh

34.4.3. Videos