21. Programming topics: sorting

[status: content-mostly-written]

21.1. Motivation, prerequisites, plan

There are several reasons to spend some time on sorting. Apart from the usefulness of sorting a list of data, we will study it because:

  • It illustrates programming techniques.

  • We will go from an intuitive understanding of sorting to how that can be expressed as an algorithm.

  • It will give us a simple and yet rich example of how to study the computational complexity of an algorithm: how the execution time grows with the size of the data we work with.

Prerequisites:

  • The 10-hour “serious programming” course.

Our plan is to start with an intuitive discussion of what sorting is, and to distill an algorithm out of that. We will look in to sorted insertion, and then into the bubblesort algorithm.

21.2. Experiment: a game of cards

Start by dealing out a game of cards. We can start with Poker, or bridge if you and the other students have already seen it.

When you receive your hand you should:

  1. Describe their algorithm to your neighbor in class.

  2. Compare what happens when you sort 4 cards to when you sort 8 or 13 cards.

  3. Take notes on paper on how you sorted your hand so as to see all the features in it.

21.3. Intuition to algorithm on card sorting

Take your description of the algorithm and try to write a python function to implement that sorting. We will do so in the framework of a small program that is ready to go as soon as you drop in your sorting function.

21.4. Writing up the algorithm in Python

I show here a simple algorithm for sorting, which might look a lot like what you have come up with. The Python function is called sort_myveryown()

def sort_myveryown(l):
    """This function implements my very own sorting algorithm, which
    is basically an insertion sort.  It takes the list l, sorts
    it, and returns the sorted result."""
    ## iterate through the whole list, skipping element 0
    for i in range(1, len(l)):
        ## having fixated on element i, look at all the ones
        ## *before* i and see if they are bigger and should
        ## be placed *after* i
        for j in range(i):
            if l[i] < l[j]:
                ## this is the case where l[j] is bigger.  since
                ## all the j come *before* the i this means that
                ## the sort is incorrect and we must swap i and j
                # print('SWAP: %d/%d  <--> %d/%d' % (i, l[i], j, l[j]))
                l[j], l[i] = l[i], l[j]
    return l

Look at the program in Listing 21.4.1 and put your own code in for the function sort_myveryown(). You can look up “insertion sort” in wikipedia for a very simple description of the algorighm, and their “pseudocode” (a description of the algorithm that is not in any specific programming language) can be translated to Python quite easily. If you feel that your procedure was different from insertion sort then adapt the code to match what you did.

Listing 21.4.1 sort_frame.py – Framework for trying out sorting algorithms. You can drop your own into the function sort_myveryown()
#! /usr/bin/env python3

"""This is a framework into which you can drop your Python sorting
routine.  The function sort_myveryown() is empty (it just returns the
original list) and you can use it to put in your own sorting
function.

Two other functions are provided for comparison: one is a
hastily-written bubblesort routine, the other is the built-in python
list sorting method.

"""

import random
import inspect

def main():
    N = 20
    run_tests(N)

def run_tests(N):
    l_presorted = list(range(N))

    ## l_random will have random numbers with a non-fixed seed
    random.seed(None)
    l_random = [0]*N
    for i in range(len(l_random)):
        l_random[i] = random.randint(0, 99)
    random.seed(1234)

    ## l_random_fixed will always have the same random numbers
    l_random_fixed = [0]*N
    for i in range(len(l_random_fixed)):
        l_random_fixed[i] = random.randint(0, 99)

    ## l_turtle is designed to do quite poorly with some algorithms:
    ## it has a small value at the end
    l_turtle = list(range(1, N-1))
    l_turtle.append(0)

    list_name_dict = {'l_presorted' : l_presorted,
                      'l_random_fixed' : l_random_fixed,
                      'l_random' : l_random,
                      'l_turtle' : l_turtle}

    for algo in (sort_quicksort, sort_python_builtin, sort_bubble, sort_myveryown):
        print('algorithm: %s' % algo.__name__)
        for which_list in list_name_dict.keys():
            print('    list: %s' % which_list)
            l_before = list_name_dict[which_list]
            l_sorted = algo(list_name_dict[which_list])
            print('              ', l_before, ' ----> ', l_sorted)

def sort_myveryown(l):
    ## FIXME: must insert my very own sorting function here
    return l

def sort_bubble(l):
    l2 = l[:]
    for i in range(len(l)):
        for j in range(len(l)-1):
            if l2[j] > l2[j+1]:
                l2[j], l2[j+1] = l2[j+1], l2[j]
    return l2


def sort_quicksort(l):
    l2 = l[:]
    do_quicksort(l2, 0, len(l)-1)
    return l2

def do_quicksort(l, low, high):
    if low < high:
        p = do_qsort_partition(l, low, high)
        do_quicksort(l, low, p-1)
        do_quicksort(l, p+1, high)

def do_qsort_partition(l, low, high):
    pivot = l[high]
    i = low-1
    for j in range(low, high):
        if l[j] < pivot:
            i = i + 1
            l[i], l[j] = l[j], l[i]
    if l[high] < l[i+1]:
        l[i+1], l[high] = l[high], l[i+1]
    return i+1

def sort_python_builtin(l):
    l2 = l[:]
    l2.sort()
    return l2

main()

21.5. Profiling the algorithm

21.5.1. Modify the program to print information

The term “profiling” is used in software to mean “figuring out how much time a computer program spends in each of its functions. This is an important tool to figure out how to improve performance.

As you have seen, the two operations which are called repeatedly in a sorting algorithm are comparison and swap. This means that the most important part of our profiling work will be to count how many times we compare two values in the list and how many times we swap two values in the list. Each algorithm will do differently on those two issues.

The program in Listing 21.5.1 gives a “framework” in which you can drop your own function and it will count how many times it does a comparison or a swap. It then prints out the results in a form that can be easily plotted.

The way the program does this is to have two functions: increment_comparisons() and increment_swaps(). We will call these functions every time we do a comparison or a swap.

The program sort_frame_profiling.py then prints out the number of comparisons and number of swaps, as well as the size of the list, for many different list sizes and for all the algorithms we use.

Listing 21.5.1 sort_frame_profiling.py – Framework for profiling sorting algorithms. You can drop your own into the function sort_myveryown() and add the increment_swaps() and increment_comparisons() calls to profile the complexity of each algorithm. Download it by clicking here: sort_frame_profiling.py
#! /usr/bin/env python3

"""This is a framework for putting in your own sorting function, but
it also has hooks for profiling the sorting function by counting the
number of comparisons and swaps.

"""

import random
import sys

# count_dict = {}

comparisons = 0
swaps = 0

def main():
    ## see if the user gave a command line argument for the list length
    if len(sys.argv) > 1:
        N_MAX = int(sys.argv[1])
    else:
        N_MAX = 100
    for N in range(N_MAX):
        run_sort_algorithms(N)

def run_sort_algorithms(N):
    l_presorted = list(range(N))

    ## l_random will have random numbers with a non-fixed seed
    random.seed(None)
    l_random = [0]*N
    for i in range(len(l_random)):
        l_random[i] = random.randint(0, 100)

    ## l_random_fixed will always have the same random numbers
    random.seed(1234)
    l_random_fixed = [0]*N
    for i in range(len(l_random_fixed)):
        l_random_fixed[i] = random.randint(0, 100)
    random.seed(None)           # return the seed to be None

    ## l_turtle is designed to do quite poorly with some algorithms:
    ## it has a small value at the end
    l_turtle = list(range(1, N-1))
    l_turtle.append(0)

    ## here is the list of names of initial list layout, mapped to the
    ## actual lists
    list_name_dict = {'l_presorted' : l_presorted,
                      'l_random_fixed' : l_random_fixed,
                      'l_random' : l_random,
                      'l_turtle' : l_turtle}

    for algo in (sort_python_builtin, sort_bubble, sort_myveryown, sort_quicksort):
        run_sort_single_algorithm(algo, N, list_name_dict)

def run_sort_single_algorithm(algo, N, list_name_dict):
    reset_stats(list_name_dict)
    # print('algorithm: %s' % algo.__name__)
    for which_list in list_name_dict.keys():
        # print('list: %s' % which_list)
        l_before = list_name_dict[which_list]
        l_sorted = algo(list_name_dict[which_list], which_list)
        # print('              ', l_before, ' ----> ', l_sorted)
        # print_stats(N, algo.__name__, which_list)
    print_stats(N, algo.__name__, list_name_dict)

def sort_myveryown(l, list_type):
    ## FIXME: must insert my very own sorting function here
    return l

def sort_bubble(l, list_type):
    l2 = l[:]
    for i in range(len(l)):
        for j in range(len(l)-1):
            increment_comparisons(list_type)
            if l2[j] > l2[j+1]:
                increment_swaps(list_type)
                l2[j], l2[j+1] = l2[j+1], l2[j]
    return l2

def sort_quicksort(l, list_type):
    l2 = l[:]
    do_quicksort(l2, 0, len(l)-1, list_type)
    return l2

def do_quicksort(l, low, high, list_type):
    if low < high:
        p = do_qsort_partition(l, low, high, list_type)
        do_quicksort(l, low, p-1, list_type)
        do_quicksort(l, p+1, high, list_type)

def do_qsort_partition(l, low, high, list_type):
    pivot = l[high]
    i = low-1
    for j in range(low, high):
        increment_comparisons(list_type)
        if l[j] < pivot:
            i = i + 1
            increment_swaps(list_type)
            l[i], l[j] = l[j], l[i]
    increment_comparisons(list_type)
    if l[high] < l[i+1]:
        increment_swaps(list_type)
        l[i+1], l[high] = l[high], l[i+1]
    return i+1

def sort_python_builtin(l, list_type):
    """Use the built-in sorting function provided by Python.  Note that
    since we don't write the innards of this function, we cannot keep
    track of how many comparisons and swaps it does.  We do know that
    Python's built-in list.sort() and sorted() functions use the
    Timsort algorithm which is a modified version of merge sort which
    uses insertion sort to arrange the list of items into conveniently
    mergeable sections.  An exercise in the text discusses figuring out
    how to count comparisons in this algorithm."""
    return sorted(l)

def reset_stats(list_name_dict):
    """Resets the counts of comparisons and swaps."""
    global comparisons
    global swaps
    comparisons = {l_type: 0 for l_type in list_name_dict.keys()}
    swaps = {l_type: 0 for l_type in list_name_dict.keys()}

def increment_comparisons(list_type):
    """Increment the counter for the number of comparisons of this
    type of list."""
    global comparisons
    comparisons[list_type] += 1

def increment_swaps(list_type):
    """Increment the counter for the number of swaps of this type 
    of list."""
    global swaps
    swaps[list_type] += 1

def print_stats(N, algo_name, list_name_dict):
    """Print a line with statistics on how this algorithm performs for the
    various lists with length N"""
    global comparisons
    global swaps
    list_types = sorted(list_name_dict.keys())
    ## open the file to write out this data
    fname = algo_name + '.out'
    if N == 0:
        ## first time around we zero out the file and write a header line
        f = open(fname, 'w')
        print('Starting to write to file %s' % fname)
        f.write('## ALGO: %s\n' % algo_name)
        f.write('## COMMENT: columns are "iter", "number-of-comparisons",\n')
        f.write('## COMMENT: "number-of-swaps" for various types of lists\n')
        f.write('## iter')
        for l_type in list_types:
            f.write('   %s  ' % l_type)
        f.write('\n')
    else:
        f = open(fname, 'a')
    f.write('%5d' % N)
    for l_type in list_types:
        # print('ALGO_%s--LIST_%s--comparisons--swaps: %d %d %d'
        #       % (algo_name, l_type, N, comparisons[l_type], swaps[l_type]))
        f.write('   %5d  %5d' % (comparisons[l_type], swaps[l_type]))
    f.write('\n')               # finish this line of data
        
    f.close()

main()

21.5.2. Run the program and make plots

When you run

mkdir sort-stats            ## let's be tidy abogut where are files are
cd sort-stats
python3 sort_frame_profiling.py

you will notice that it takes a while to run (depending on the value of N_MAX in main()), and when it is done you will have a few files called sort_ALGO.out for each of the algorithms.

You don’t have to wait for it to finish! You can open another terminal and type:

cd sort-stats
ls -lsat | head             ## look at recently modified files
wc sort_*.out               ## see how many lines each file has
head sort_*.out             ## look at the start of each file
tail sort_*.out             ## look at the end of each file

You can repeat that tail command several times as the run goes on and you will get a feeling for the progress being made.

You can also make plots of the performance of the algorithms. Run gnuplot:

$ gnuplot

and in gnuplot run the following instructions:

Listing 21.5.2 Plot bubblesort and quicksort timing.
set grid
set xlabel 'N (size of list)'
set ylabel 'number of operations'
plot 'sort_bubble.out' using 1:4 with lines title 'bubble/random/comparisons', \
 'sort_bubble.out' using 1:5 with lines title 'bubble/random/swaps', \
 'sort_quicksort.out' using 1:4 with lines title 'quicksort/random/comparisons', \
 'sort_quicksort.out' using 1:5 with lines title 'quicksort/random/swaps'
../_images/plot-bubble-quick.svg

Figure 21.5.1 The performance of bubblesort and quicksort on a random initial list. Note that bubblesort operations grow as \(n^2\) while quicksort operations grow as \(n \times log(n)\), which is a much slower growth. This kind of plot shows how the number of operations in an algorithm depends on the size of the problem. This dependence is called the “computational complexity” of an algorithm.

21.5.3. How do we understand these plots?

[TODO] Make some plots in gnuplot of x**2 and x log(x), then introduce two bits of terminology: “computational complexity” and “asymptotic behavior”.

21.5.4. Exercises

Exercise 21.1

Modify sort_frame_profiling.py to count the number of comparisons made by Python’s built-in sort function. One way to do this is to use an argument to l.sort(my_compare) which replaces Python’s built-in comparison function. You could then write your own my_compare() function which would call increment_comparison(l_type).

Exercise 21.2

Research whether it is possible to count the number of swaps made by Python’s built-in sort function.

21.6. Computational complexity

Make plots versus N and compare to \(N^2\) and \(N log(N)\)

21.7. Further reading

The wikipedia articles on various sorting methods are worth a read:

There is also a web site with animations of several sorting algorithms:

A youtube video that gives video and audio animation of a sort: