22. Birthday paradox

[status: usable-but-incomplete]

22.1. To get started

We start by going around the room and seeing how many people there are. Then we ask everyone to estimate “what’s the chance that at least two people have the same birthday?”

We look at the estimates and see how they vary, and ask students why they picked the probability they did.

Then we talk about picking one person and asking what’s the probability that someone else has their same birthday.

How is that a different question, and would the probability is the same, smaller, or bigger?

22.2. A practical demonstration

Look at the code in Listing 22.2.1:

Listing 22.2.1 Simulate a party with several people and calculate the probability that two of them share a birthday.
#! /usr/bin/env python3

"""Simulates the birthday paradox.  You set how many people
are at the party.  The program will assign a random birthday
to each person, and then calculate how mahy duplicate
birthdays are in that group."""

import random

def main():
    n_people = 25
    birthdays = make_birthdays(n_people) # randomly distributed
    print('## Single example with %d people:' % n_people)
    print('## birthdays:', birthdays)
    print('## birthdays_sorted:', sorted(birthdays))
    n_doubles, n_triplets = count_multiples(birthdays)
    print(f'## doubles: {n_doubles}   triplets: {n_triplets}')
    # return                      # stop here for the simplest analysis
    n_tries = 200
    print('## running %d tries' % n_tries)
    for n_people in range(100):
        avg_doubles, avg_triplets = get_average_multiples(n_people, n_tries)
        print(f'n_people: {n_people} -- frac_doubles: {avg_doubles}   frac_triplets: {avg_triplets}')

def make_birthdays(n_people):
    """Generate a birthday for each person, but we do it the easy way: a
    number from 1 to 365, so we don't handle leap years.
    """
    # the list "birthdays" will store the birthday of each person
    birthdays = [0]*n_people
    for person in range(n_people):
        day = random.randint(1, 365) # generate random b-day
        birthdays[person] = day      # store it for that person
    return birthdays

def count_multiples(bdays):
    """Goes through the birthday list and sees if any two people
    have the same birthday.  Returns how many times we find
    duplicate birthdays."""
    n_people = len(bdays)
    count_doubles = 0
    count_triplets = 0
    for person in range(n_people):
        this_bday = bdays[person]
        # we can look at just the "further on" entries in this list
        # since we have already looked for duplicates before this point.
        for other_dude in range(person + 1, n_people):
            other_bday = bdays[other_dude]
            # now see if two people have the same birthday
            if this_bday == other_bday:
                # we have a duplicate!
                count_doubles += 1
                # now look for a third dude; once
                # again I can look beyond this point
                for third_dude in range(other_dude + 1, n_people):
                    third_bday = bdays[third_dude]
                    if third_bday == other_bday:
                        count_triplets += 1 # I have a third!!
    return count_doubles, count_triplets
            

def get_average_multiples(n_people, n_tries):
    """Estimates an expectation of finding people with the same birthday.
    Returns the probability of two and of three people having the same
    birthday.
    """
    n_with_doubles = 0
    n_with_triplets = 0
    for i in range(n_tries):
        bdays = make_birthdays(n_people)
        n_doubles, n_triplets = count_multiples(bdays)
        if n_doubles >= 1:
            n_with_doubles += 1
        if n_triplets >= 1:
            n_with_triplets += 1
    avg_doubles = (1.0*n_with_doubles) / n_tries
    avg_triplets = (1.0*n_with_triplets) / n_tries
    return avg_doubles, avg_triplets


main()

The result of running birthdays.py: this calculates the probability that two people at a party will share the same birthday for party. We put in population sizes of 0 to 50, but you can obviously extend that all the way to 365 or more.

We can plot this output with:

python3 birthdays.py > bday_prob.out

and plot with:

gnuplot
# then at the gnuplot> prompt type:
plot 'bday_prob.out' using 2:7 with linespoints

22.3. The theory

The calculation is a bit involved, but it is a very good example of the field of combinatorics, and it teaches us a lot about how to count things. Since this example was clearly counterintuitive, the exercise of counting well is a good one.

The simplifying idea is to first calculate the opposite: the probability that no two people have the same birthday.

Imagine a sequence of events in which people enter an empty room.

For one person the probability of no duplicate birthdays is 1, which is the same as \(365/365 = 1.0\).

Let us say you have 2 people, then the probability that the second person does not have the same birthday as the first is \(364/365 \approx 0.99726\).

Add a 3rd person and you have \(363/365 \approx 0.99452\) chance that the newcomer’s birthday does not match one of the other two.

To get the probability of no matches with 3 people you take the join probability of the 3 situations: 1.0 when the first person enters the room, times the probabilities of no matches with each successive entry into the room:

\[\begin{split}P_{\rm no\_dup}(n) = \frac{365}{365} \times \frac{364}{365} \times \frac{363}{365} \approx 0.991795 \\ P_{\rm duplicates}(n) = 1 - P({\rm no\_dup}) \approx 1 - 0.991795 \approx 0.00820417\end{split}\]

The general formula after \(n\) people have entered the room is:

\[\begin{split}P_{\rm no\_dup}(n) & = & \frac{n}{n} & \times \frac{n-1}{n} \times \frac{n-2}{n} \times \dots \times \frac{1}{365} \\ & = & (space) & \frac{365 \times (365-1) \times \dots \times (365 - n + 1)}{365^n}\end{split}\]

Using the definition of factorial, we see that the numerator is:

\[\begin{split}{\rm numerator} = & 365 \times 364 \times 363 \times ... \times (365-n+1) = \frac{365!}{(365-n)!} \\ & = n! \times { 365 \choose n}\end{split}\]

where we used the mathematical “choose” notation:

\[\begin{split}{n \choose k} & = & \frac{n (n-1) \dots (n - k + 1)}{k (k-1) \dots 1} \\ & = & \frac{n!}{k!(n-k)!}\end{split}\]

So our formula is:

\[\begin{split}P_{\rm no\_dup}(n) = \frac{n! \times {365 \choose n}}{365^n} \\ P_{\rm duplicates}(n) = 1 - \frac{n! \times {365 \choose n}}{365^n}\end{split}\]

Do we believe this? Well, let us compare it to our monte carlo calculation for the tipping point of \(n = 23\):

\[P_{\rm duplicates}(23) = 1 - \frac{365! \times {365 \choose 23}}{365^{23}} \approx 1 - 0.492703 \approx 0.507297\]

22.4. Take-home

What we have learned:

  • Simulate a situation (in this case people sharing birthdays).

  • Calculate the probability of an event with a random component. We do this by running the event many times and averaging the outcome.

  • Jargon: you could think of this as a very simple example of a “monte carlo” simulation.