23. Programming topics: sorting
[status: content-mostly-written]
23.1. Motivation, prerequisites, plan
There are several reasons to spend some time on sorting. Apart from the usefulness of sorting a list of data, we will study it because:
It illustrates programming techniques.
We will go from an intuitive understanding of sorting to how that can be expressed as an algorithm.
It will give us a simple and yet rich example of how to study the computational complexity of an algorithm: how the execution time grows with the size of the data we work with.
Prerequisites:
The 10-hour “serious programming” course.
Our plan is to start with an intuitive discussion of what sorting is, and to distill an algorithm out of that. We will look in to sorted insertion, and then into the bubblesort algorithm.
23.2. Experiment: a game of cards
Start by dealing out a game of cards. We can start with Poker, or bridge if you and the other students have already seen it.
When you receive your hand you should:
Describe their algorithm to your neighbor in class.
Compare what happens when you sort 4 cards to when you sort 8 or 13 cards.
Take notes on paper on how you sorted your hand so as to see all the features in it.
23.3. Intuition to algorithm on card sorting
Take your description of the algorithm and try to write a python function to implement that sorting. We will do so in the framework of a small program that is ready to go as soon as you drop in your sorting function.
23.4. Writing up the algorithm in Python
I show here a simple algorithm for sorting, which might look a lot
like what you have come up with. The Python function is called
sort_myveryown()
def sort_myveryown(l):
"""This function implements my very own sorting algorithm, which
is basically an insertion sort. It takes the list l, sorts
it, and returns the sorted result."""
## iterate through the whole list, skipping element 0
for i in range(1, len(l)):
## having fixated on element i, look at all the ones
## *before* i and see if they are bigger and should
## be placed *after* i
for j in range(i):
if l[i] < l[j]:
## this is the case where l[j] is bigger. since
## all the j come *before* the i this means that
## the sort is incorrect and we must swap i and j
# print('SWAP: %d/%d <--> %d/%d' % (i, l[i], j, l[j]))
l[j], l[i] = l[i], l[j]
return l
Look at the program in Listing 23.4.1 and put your
own code in for the function sort_myveryown()
. You can look up
“insertion sort” in wikipedia for a very simple description of the
algorighm, and their “pseudocode” (a description of the algorithm that
is not in any specific programming language) can be translated to
Python quite easily. If you feel that your procedure was different
from insertion sort then adapt the code to match what you did.
#! /usr/bin/env python3
"""This is a framework into which you can drop your Python sorting
routine. The function sort_myveryown() is empty (it just returns the
original list) and you can use it to put in your own sorting
function.
Two other functions are provided for comparison: one is a
hastily-written bubblesort routine, the other is the built-in python
list sorting method.
"""
import random
import inspect
def main():
N = 20
run_tests(N)
def run_tests(N):
l_presorted = list(range(N))
## l_random will have random numbers with a non-fixed seed
random.seed(None)
l_random = [0]*N
for i in range(len(l_random)):
l_random[i] = random.randint(0, 99)
random.seed(1234)
## l_random_fixed will always have the same random numbers
l_random_fixed = [0]*N
for i in range(len(l_random_fixed)):
l_random_fixed[i] = random.randint(0, 99)
## l_turtle is designed to do quite poorly with some algorithms:
## it has a small value at the end
l_turtle = list(range(1, N-1))
l_turtle.append(0)
list_name_dict = {'l_presorted' : l_presorted,
'l_random_fixed' : l_random_fixed,
'l_random' : l_random,
'l_turtle' : l_turtle}
for algo in (sort_quicksort, sort_python_builtin, sort_bubble, sort_myveryown):
print('algorithm: %s' % algo.__name__)
for which_list in list_name_dict.keys():
print(' list: %s' % which_list)
l_before = list_name_dict[which_list]
l_sorted = algo(list_name_dict[which_list])
print(' ', l_before, ' ----> ', l_sorted)
def sort_myveryown(l):
## FIXME: must insert my very own sorting function here
return l
def sort_bubble(l):
l2 = l[:]
for i in range(len(l)):
for j in range(len(l)-1):
if l2[j] > l2[j+1]:
l2[j], l2[j+1] = l2[j+1], l2[j]
return l2
def sort_quicksort(l):
l2 = l[:]
do_quicksort(l2, 0, len(l)-1)
return l2
def do_quicksort(l, low, high):
if low < high:
p = do_qsort_partition(l, low, high)
do_quicksort(l, low, p-1)
do_quicksort(l, p+1, high)
def do_qsort_partition(l, low, high):
pivot = l[high]
i = low-1
for j in range(low, high):
if l[j] < pivot:
i = i + 1
l[i], l[j] = l[j], l[i]
if l[high] < l[i+1]:
l[i+1], l[high] = l[high], l[i+1]
return i+1
def sort_python_builtin(l):
l2 = l[:]
l2.sort()
return l2
main()
23.5. Profiling the algorithm
23.5.1. Modify the program to print information
The term “profiling” is used in software to mean “figuring out how much time a computer program spends in each of its functions. This is an important tool to figure out how to improve performance.
As you have seen, the two operations which are called repeatedly in a sorting algorithm are comparison and swap. This means that the most important part of our profiling work will be to count how many times we compare two values in the list and how many times we swap two values in the list. Each algorithm will do differently on those two issues.
The program in Listing 23.5.1 gives a “framework” in which you can drop your own function and it will count how many times it does a comparison or a swap. It then prints out the results in a form that can be easily plotted.
The way the program does this is to have two functions:
increment_comparisons()
and increment_swaps()
. We will call
these functions every time we do a comparison or a swap.
The program sort_frame_profiling.py
then prints out the number of
comparisons and number of swaps, as well as the size of the list, for
many different list sizes and for all the algorithms we use.
#! /usr/bin/env python3
"""This is a framework for putting in your own sorting function, but
it also has hooks for profiling the sorting function by counting the
number of comparisons and swaps.
"""
import random
import sys
# count_dict = {}
comparisons = 0
swaps = 0
def main():
## see if the user gave a command line argument for the list length
if len(sys.argv) > 1:
N_MAX = int(sys.argv[1])
else:
N_MAX = 100
for N in range(N_MAX):
run_sort_algorithms(N)
def run_sort_algorithms(N):
l_presorted = list(range(N))
## l_random will have random numbers with a non-fixed seed
random.seed(None)
l_random = [0]*N
for i in range(len(l_random)):
l_random[i] = random.randint(0, 100)
## l_random_fixed will always have the same random numbers
random.seed(1234)
l_random_fixed = [0]*N
for i in range(len(l_random_fixed)):
l_random_fixed[i] = random.randint(0, 100)
random.seed(None) # return the seed to be None
## l_turtle is designed to do quite poorly with some algorithms:
## it has a small value at the end
l_turtle = list(range(1, N-1))
l_turtle.append(0)
## here is the list of names of initial list layout, mapped to the
## actual lists
list_name_dict = {'l_presorted' : l_presorted,
'l_random_fixed' : l_random_fixed,
'l_random' : l_random,
'l_turtle' : l_turtle}
for algo in (sort_python_builtin, sort_bubble, sort_myveryown, sort_quicksort):
run_sort_single_algorithm(algo, N, list_name_dict)
def run_sort_single_algorithm(algo, N, list_name_dict):
reset_stats(list_name_dict)
# print('algorithm: %s' % algo.__name__)
for which_list in list_name_dict.keys():
# print('list: %s' % which_list)
l_before = list_name_dict[which_list]
l_sorted = algo(list_name_dict[which_list], which_list)
# print(' ', l_before, ' ----> ', l_sorted)
# print_stats(N, algo.__name__, which_list)
print_stats(N, algo.__name__, list_name_dict)
def sort_myveryown(l, list_type):
## FIXME: must insert my very own sorting function here
return l
def sort_bubble(l, list_type):
l2 = l[:]
for i in range(len(l)):
for j in range(len(l)-1):
increment_comparisons(list_type)
if l2[j] > l2[j+1]:
increment_swaps(list_type)
l2[j], l2[j+1] = l2[j+1], l2[j]
return l2
def sort_quicksort(l, list_type):
l2 = l[:]
do_quicksort(l2, 0, len(l)-1, list_type)
return l2
def do_quicksort(l, low, high, list_type):
if low < high:
p = do_qsort_partition(l, low, high, list_type)
do_quicksort(l, low, p-1, list_type)
do_quicksort(l, p+1, high, list_type)
def do_qsort_partition(l, low, high, list_type):
pivot = l[high]
i = low-1
for j in range(low, high):
increment_comparisons(list_type)
if l[j] < pivot:
i = i + 1
increment_swaps(list_type)
l[i], l[j] = l[j], l[i]
increment_comparisons(list_type)
if l[high] < l[i+1]:
increment_swaps(list_type)
l[i+1], l[high] = l[high], l[i+1]
return i+1
def sort_python_builtin(l, list_type):
"""Use the built-in sorting function provided by Python. Note that
since we don't write the innards of this function, we cannot keep
track of how many comparisons and swaps it does. We do know that
Python's built-in list.sort() and sorted() functions use the
Timsort algorithm which is a modified version of merge sort which
uses insertion sort to arrange the list of items into conveniently
mergeable sections. An exercise in the text discusses figuring out
how to count comparisons in this algorithm."""
return sorted(l)
def reset_stats(list_name_dict):
"""Resets the counts of comparisons and swaps."""
global comparisons
global swaps
comparisons = {l_type: 0 for l_type in list_name_dict.keys()}
swaps = {l_type: 0 for l_type in list_name_dict.keys()}
def increment_comparisons(list_type):
"""Increment the counter for the number of comparisons of this
type of list."""
global comparisons
comparisons[list_type] += 1
def increment_swaps(list_type):
"""Increment the counter for the number of swaps of this type
of list."""
global swaps
swaps[list_type] += 1
def print_stats(N, algo_name, list_name_dict):
"""Print a line with statistics on how this algorithm performs for the
various lists with length N"""
global comparisons
global swaps
list_types = sorted(list_name_dict.keys())
## open the file to write out this data
fname = algo_name + '.out'
if N == 0:
## first time around we zero out the file and write a header line
f = open(fname, 'w')
print('Starting to write to file %s' % fname)
f.write('## ALGO: %s\n' % algo_name)
f.write('## COMMENT: columns are "iter", "number-of-comparisons",\n')
f.write('## COMMENT: "number-of-swaps" for various types of lists\n')
f.write('## iter')
for l_type in list_types:
f.write(' %s ' % l_type)
f.write('\n')
else:
f = open(fname, 'a')
f.write('%5d' % N)
for l_type in list_types:
# print('ALGO_%s--LIST_%s--comparisons--swaps: %d %d %d'
# % (algo_name, l_type, N, comparisons[l_type], swaps[l_type]))
f.write(' %5d %5d' % (comparisons[l_type], swaps[l_type]))
f.write('\n') # finish this line of data
f.close()
main()
23.5.2. Run the program and make plots
When you run
mkdir sort-stats ## let's be tidy abogut where are files are
cd sort-stats
python3 sort_frame_profiling.py
you will notice that it takes a while to run (depending on the value
of N_MAX
in main()
), and when it is done you will have a few
files called sort_ALGO.out
for each of the algorithms.
You don’t have to wait for it to finish! You can open another terminal and type:
cd sort-stats
ls -lsat | head ## look at recently modified files
wc sort_*.out ## see how many lines each file has
head sort_*.out ## look at the start of each file
tail sort_*.out ## look at the end of each file
You can repeat that tail
command several times as the run goes on
and you will get a feeling for the progress being made.
You can also make plots of the performance of the algorithms. Run gnuplot:
$ gnuplot
and in gnuplot run the following instructions:
set grid
set xlabel 'N (size of list)'
set ylabel 'number of operations'
plot 'sort_bubble.out' using 1:4 with lines title 'bubble/random/comparisons', \
'sort_bubble.out' using 1:5 with lines title 'bubble/random/swaps', \
'sort_quicksort.out' using 1:4 with lines title 'quicksort/random/comparisons', \
'sort_quicksort.out' using 1:5 with lines title 'quicksort/random/swaps'
23.5.3. How do we understand these plots?
[TODO] Make some plots in gnuplot of x**2
and x log(x)
, then
introduce two bits of terminology: “computational complexity” and
“asymptotic behavior”.
23.5.4. Exercises
Modify sort_frame_profiling.py
to count the number of
comparisons made by Python’s built-in sort function. One way to
do this is to use an argument to l.sort(my_compare)
which
replaces Python’s built-in comparison function. You could then
write your own my_compare()
function which would call
increment_comparison(l_type)
.
Research whether it is possible to count the number of swaps made by Python’s built-in sort function.
23.6. Computational complexity
Make plots versus N
and compare to \(N^2\) and \(N log(N)\)
23.7. Further reading
The wikipedia articles on various sorting methods are worth a read:
There is also a web site with animations of several sorting algorithms:
A youtube video that gives video and audio animation of a sort: