41. Appendix: Project proposals
Contents of this chapter
Motivation
This appendix has some ideas for student projects. They are crafted with the idea of getting students to:
craft or using interesting algorithms
write some code
explore a topic from a wide selection of areas of endeavor
optionally then contribute a chapter to this book
Each section of this appendix should ideally have a link to a concise and clear definition of the topic, and ideas on how to start the exploration.
41.2. Sports
41.2.1. Time series for improvement on records
Start out by having a pure poisson process to generate improvement in “world records”. For example, try something like:
import numpy as np
s = np.random.poisson(5, 10000)
as shown at https://numpy.org/doc/stable/reference/random/generated/numpy.random.poisson.html
Keep track of when a new sample beats all previous records, and study the time between improvements on world records.
Then:
Examine the history of improvement in individual world records, for example in the 100m swimming (both female and male), 100m running, high jump, … Compare to the pure Poisson distribution.
In chess examine the history of surpassing “highest ever” Elo ratings for (a) top player, (b) top 10 average, (c) top 100 average, and look at when previous records get broken. Try the same with alternative chess metrics.
Try to see if this hypothesis is valid: improvements in world records are randomly distributed (maybe Poisson? how about other distributions?) But sometimes you have a jump sooner than expected by the statistics. This could correspond to a “freak of nature” player (but the statistics could also generate such players. More interesting are advances in training technique and in equipment. Also of note might be an increase in the population that plays that sport.
41.2.2. Optimal tournament structure
Create a group of (say) 32 virtual tournament players, and assign them a rating, similar to chess Elo ratings.
Then set them up in double-round-robin, round-robin, swiss system, and direct knockout.
Examine the consistency of the results.
Ask couple of questions (and maybe more):
What is the best tournament format for various situations?
How many places can be determined (just 1st? 1st, 2nd, 3rd? …) using the double-round-robin as the gold standard.
41.2.3. Soccer analytics
See what you can do starting from this article about Messi in the 538 blog:
https://fivethirtyeight.com/features/lionel-messi-is-impossible/
41.3. Physical sciences
41.3.1. Brownian motion
See if it is straightforward to write an agent-based model (or other simulation) for brownian motion.
Then see if Einstein’s 1905 results on Brownian motion can be derived from this simulation.
Discuss, with these results in hand, how all of this relates to the existence of atoms.
41.4. Life sciences
41.4.1. Datasets for phylogenetic analysis
Subject areas: biology, ecology, linguistics
Find specific datasets and create examples of phylogenetic trees.
These are more likely to come from biology, but they could also come from linguistics or other areas of social science.
Start with canned examples, like in the 2021 Medium article by Rishika Gupta, and learn to make phylogenetic trees (including going beyond her examples).
Then look for coded datasets from language or mythology or other social science sources.
Possible references:
41.4.2. Predator-prey ecology: equations and agents
Subject areas: biology, ecology
Craft a predator-prey scenario in two different ways: (a) by solving the Lotka-Volterra equations as shown in Almond Heil’s model Section 18, and (b) using the agent-based modeling system shown in Section 28
Start by adapting Almond Heil’s model to predator-prey interaction (you might also get this from the basic Mesa examples).
Then try to set up the same model in a Lotka-Volterra setting: solve the Lotka-Volterra model with a differential equation solver, and see if its cyclic population behavior matches what you get from the agent-based model.
After looking at data and plots, do some thinking about the assumptions that go in to both kinds of models, and try to make them match more closely.
If you cannot get a close match, do some thinking on the mechanisms behind both approaches and try to explain the differences.
Possible references:
41.4.3. Infectious disease modeling
Subject areas: biology, epidemiology
The basic model used for infectious diseases is the SIR model, where you track “susceptible”, “infected”, and “removed” population members.
You can also model it with agent-based models.
So the idea here is to look at the SIR model in two different ways: (a) using the agent-based modeling system shown in Section 28, and (b) by solving the SIR differential equations.
Start by adapting Almond Heil’s model to have variables that track the full S, I, and R populations, and color them appropriately.
Then set up the SIR differential equations to have the same starting values, and run them with a simple differential equation solver - you can write your own Euler’s method solver, or use the scipy runge-kutta solver.
Possible references:
41.5. Mathematics
41.5.1. puzzles
[must flesh this out]
Jason Breshears at https://archaics.com/ has a youtube video where he presents what’s special for the number 2178 in his observation.
41.5.2. On the role of intuition in mathematics
Read in depth this article on mathematical intuition by famous mathematician Terence Tao:
https://terrytao.wordpress.com/career-advice/theres-more-to-mathematics-than-rigour-and-proofs/
and try to devise a simple problem in mathematics (maybe one of the ones below) that could illustrate the path from pre-rigorous to rigorous to post-rigorous stages in one’s mathematical journey.
41.5.3. Monty Hall’s door problem
Write a simulation of the famous Monty Hall door problem. The simulation should be a very simple programming task.
This could be done with a language you already know, or as an exercise to learn a new programming language, since the simulation will be very simple.
Leonard Mlodinow “The Drunkard’s Walk”, chapter 3 “Finding Your Way Through a Space of Possibilities”
41.5.4. Random walks
Subject areas: mathematics
Look at the theory of random walks and compare it to simulations.
Possible references: the chapter on randomness and disorder, Section 12
41.5.5. Runge-kutta method
Subject areas: mathematics
Compare the runge-kutta library (gsl or scipy) result to a hand-coded Euler method. Compare first-point, mid-point, and last-point of Euler’s method to runge-kutta.
The compare them all to the exact solution for the nonlinear pendulum, and show the difference.
41.5.6. Sync and the Kuramoto model
Subject areas: mathematics
Look at the simple mathematical models for sync described in Steven Strogatz’s lecture:
and implement the simple mathematical models that he shows.
He discusses the Kuramoto model at about 40 minutes into the video, and he shows an example of a program that visualizes that model.
Learn the model, and then figure out how to implement it in python, starting with text output and then using a simple widget set for the visualization.
Possible references:
41.5.7. Analysis of chess
No articulation of a project yet, but the ideas are approximately:
How does the advantage of the white pieces change with player ratings, beginner to grandmaster? And to time controls?
What are the standard deviations of victory predictions based on ratings?
Possible sources:
lichess.org API
https://en.wikipedia.org/wiki/Comparison_of_top_chess_players_throughout_history
https://lichess.org/blog/YafSBxEAACIAr0ZA/exact-exacting-who-is-the-most-accurate-world-champion
Understanding distributions of chess performances. Study and reproduce this paper by Regan:
https://cse.buffalo.edu/~regan/papers/pdf/
https://cse.buffalo.edu/~regan/papers/pdf/RMH11.pdf
(There is also an RMH11b.pdf – which is the more recent version?)
Also look at this quora answer that mentions it: https://www.quora.com/What-is-the-most-important-move-in-chess/answer/H%C3%A5kon-Hapnes-Strand?ch=10&oid=340561749&share=7fabf913&srid=3MvD&target_type=answer
41.6. Computer science and information technology
41.6.1. Situational awareness of your network
Subject areas: IT, computer programming
The idea here is to understand what devices are on your network, and what services they offer.
Start by familiarizing yourself with the commands in the “How to get insight into a network” chapter of the Sysadmin Hacks book at at https://markgalassi.codeberg.page/sysadmin-hacks-html/ (you might want to look at the previous chapter “The basics of networking” for background).
In particular the sections on the nmap and pnscan commands are the starting point for what we want to do. Observe the command line output for each of them with the examples shown in the book.
Then separately install the program etherape and look at what it does.
Finally, look at the graphical output from the program map_network_utils.py in that chapter.
At this point you can take two different directions: one is to keep using separate utilities to visualize the network, in this case the powerful graphviz tool. Graphviz allows a lot of very fancy layout options, and quite a bit more information about the network can be added to the display in the form of logos for the operating system (Linux, android, …), or for the protocol (ssh, imap, …), or one could try to probe traffic information and visualize that.
41.7. Humanities and the arts
41.7.1. Music generation: tone beyond sin waves
Topics: music
Take the simple sine wave generators shown in Section 32 and investigate how to make the tone more like that of a guitar string.
This will involve moving from a sin() wave to a more complex funciton. One start could be to add the harmonics to make a sawtooth wave or a triangle wave, but you should do some research to find out what simple tone generators sound interesting.
41.7.2. Music generation: add stereo
Topics: music
Take the simple sine wave generators shown in Section 32 and investigate how to add an artificial sterephonic feeling.
Right now the examples generate identical left and right channel signals. The idea here is to look at how to modify them slightly in various ways, and to see if that creates a stereo effect.
After a bit of tinkering and experimentation you could then do research to see if there is any known mathematics for doing this, and implement those techniques.
41.7.3. Further explorations into Zipf’s law
Subject areas: literature, digital humanities
Start from the simple programs that demonstrate Zipf’s law in Section 14. Then read Isabella Trejo’s presentation from the Institute for Computing in Research at https://github.com/izzytrejo/Zipf to understand some of the proposed underpinnings for Zipf’s law in literary text.
Then investigate some mathematical properties of the various translations, such as the power law slopes and the power law breaks.
Does this happen consistently as you look at the translation of different books into the same other languages?
How does this depend on the translator? For example, can you find examples of the same translator going between the same languages on different books? How about different translators on the same books to the same other languages?
This might reveal Zipf’s law to be an interesting tool for analyzing style and/or language, or it might show that it is not sufficient to get much insight. If the latter, then the next step might be to research what other techniques are used in digital humanities for this kind of analysis.
Possible references:
Courtney Lawton’s dissertation: https://digitalcommons.unl.edu/dissertations/AAI10831194/ (we need to find the full book)
Isabella Trejo’s project: https://github.com/izzytrejo/Zipf
41.7.4. Analyzing wordle
The popular game wordle has spawned many variants, as well as applications to other languages.
In analogy to “opening theory” in chess, it would be interesting to develop a repertoire of best “starting words”.
We can try to use shell commands (like grep with regular expressions)
and the /usr/share/dict/words
text file to learn as much as
possible about optimal first word choices. We can switch to a full
python program once we hit the limits of this approach.
The candidate best words should then be tested on a large monte-carlo sampling of possible challenge words to see which works better.
This can then be expanded in interesting ways to (a) apply to quordle (4 words simultaneously) and larger variants, or (b) to use psychological factors on guessing words to fine-tune the opening words.
To give a bit of a start to see how easy it is to do a command-line study of wordle:
# what are all the 5-letter entries in the dictionary?
grep '^.....$' /usr/share/dict/words
# how about if you exclude names and punctuation?
egrep '^.....$' /usr/share/dict/words | egrep -v '[^a-z]'
# many entries are there?
egrep '^.....$' /usr/share/dict/words | egrep -v '[^a-z]' | wc
# what words have 'y' in the 3rd entry?
egrep '^.....$' /usr/share/dict/words | egrep -v '[^a-z]' | grep '..y..'
# how many words have 'y' in the 3rd entry?
egrep '^.....$' /usr/share/dict/words | egrep -v '[^a-z]' | grep '..y..' | wc
# how many words start with 'f' and have 'y' in the 3rd entry?
egrep '^.....$' /usr/share/dict/words | egrep -v '[^a-z]' | grep
'f.y..' | wc
Possible references for word lists:
look at the file /usr/share/dict/words
https://runestone.academy/ns/books/published/fopp/Projects/common_words.html#common-words
Possible references on other such games:
41.7.5. Can generative AI make art or music with a simple project?
Topics: music, visual art
Investigate the various AI engines for which people have written generative AI modules (I think there are some for both tensorflow and pytorch). Only look at free/open-source modules.
Explore the results of generative art and music for those modules, and then see if it is possible to give a simple procedure for someone to start from scratch and end up with running programs that do the work.
41.1. Social sciences
Note that some of the biology topics, like phylogenetic analysis, are also social science areas.
41.1.1. Optimal stopping and life/business
Subject areas: psychology, economics
One variant of optimal stopping is the Marriage Problem, or Secretary Problem. It is described at https://www.geeksforgeeks.org/secretary-problem-optimal-stopping-problem/
Possible project: craft a simulation modeling the applicant pool as a list of N applicant scores. Then generate a very large number of applicant pools.
For each applicant pool go through them and stop after k applicants, taking note of the best score so far. Then continue, picking the first applicant with a score greater than that “best so far”. Record this for all possible k (between 1 and N), and show which stopping point k would be the best.
Compare this with the theoretical value of 1/e.
Possible references:
Brian Christian and Tom Griffiths - “Algorithms to live by: the computer science of human decisions”. The chapter on “Optimal stopping: when to stop looking”.
Blog post inspired by the book: https://medium.com/geekculture/computer-science-algorithms-in-daily-life-optimal-stopping-608d6868b1b
41.1.2. Multi-armed bandits and exploration vs. exploitation
Subject areas: psychology, economics
Look at the problem of exploration vs. exploitation in general, and then consider human and societal situations in which it can be applied.
Psychology: risk aversion in different phases of life.
Management: employee tasking in different phases of a project.
Project: using literature from psychology, find data sets that show how real people do exploration vs. exploitation, and compare them to the theoretical result.
Possible references:
Brian Christian and Tom Griffiths - “Algorithms to live by: the computer science of human decisions”. The chapter on “Explore/exploit : the latest vs. the greatest”
https://medium.com/geekculture/computer-science-algorithms-in-daily-life-explore-vs-exploit-a613232cf9e1
41.1.3. Deadly conflicts
Subject areas: history.
Explore data sets (which can be found in the references below) on the statistics of casualties in war. Look for a pedagogically useful visualization of one or more data sets.
The two plots that could be made involved (a) the power law showing the general behavior, and (b) trend plots throughout history.
A discussion of power laws can then be interesting.
Possible references:
“The Better Angels of Our Nature: Why Violence Has Declined” - by Steven Pinker. Chapters: The statistics of deadly quarrels, Part 1 : the timing of wars ; The statistics of deadly quarrels, Part 2 : the magnitude of wars.
Aaron Clauset: “Trends and fluctuations in the severity of interstate wars” - https://www.science.org/doi/10.1126/sciadv.aao3580
Cunen, Hjort, Nygard: “Statistical sightings of better angels: Analysing the distribution of battle-deaths in interstate conflict over time”. Article at: https://journals.sagepub.com/doi/10.1177/0022343319896843 and replication data at: https://www.prio.org/journals/jpr/replicationdata
41.1.4. Quantifying overfitting in personal decisions
[…]
41.1.5. Just about anything from Gwern Branwen
Gwern Branwent’s web site has a vast collection of research interests, mostly related to social science. It would be good to distill projects from them.
https://www.gwern.net/
https://www.gwern.net/Archiving-URLs