.. _app-project-proposals:

=============================
 Appendix: Project proposals
=============================

.. contents:: Contents of this chapter
   :depth: 2
   :local:


.. rubric:: Motivation

This appendix has some ideas for student projects.  They are crafted
with the idea of getting students to:

* craft or using interesting algorithms

* write some code

* explore a topic from a wide selection of areas of endeavor

* optionally then contribute a chapter to this book

Each section of this appendix should ideally have a link to a concise
and clear definition of the topic, and ideas on how to start the
exploration.


Social sciences
===============

Note that some of the biology topics, like phylogenetic analysis, are
also social science areas.


Optimal stopping and life/business
----------------------------------

Subject areas: psychology, economics

One variant of optimal stopping is the Marriage Problem, or Secretary
Problem.  It is described at
https://www.geeksforgeeks.org/secretary-problem-optimal-stopping-problem/

Possible project: craft a simulation modeling the applicant pool as a
list of N applicant scores.  Then generate a very large number of
applicant pools.

For each applicant pool go through them and stop after k applicants,
taking note of the best score so far.  Then continue, picking the
first applicant with a score greater than that "best so far".  Record
this for all possible k (between 1 and N), and show which stopping
point k would be the best.

Compare this with the theoretical value of 1/e.

Possible references:

* Brian Christian and Tom Griffiths - "Algorithms to live by: the
  computer science of human decisions".  The chapter on
  "Optimal stopping: when to stop looking".

* Blog post inspired by the book:
  https://medium.com/geekculture/computer-science-algorithms-in-daily-life-optimal-stopping-608d6868b1b


Multi-armed bandits and exploration vs. exploitation
----------------------------------------------------

Subject areas: psychology, economics

Look at the problem of exploration vs. exploitation in general, and
then consider human and societal situations in which it can be
applied.

Psychology: risk aversion in different phases of life.

Management: employee tasking in different phases of a project.

Project: using literature from psychology, find data sets that show
how real people do exploration vs. exploitation, and compare them to
the theoretical result.

Possible references:

* Brian Christian and Tom Griffiths - "Algorithms to live by: the
  computer science of human decisions".  The chapter on
  "Explore/exploit : the latest vs. the greatest"  

* https://medium.com/geekculture/computer-science-algorithms-in-daily-life-explore-vs-exploit-a613232cf9e1


Deadly conflicts
----------------

Subject areas: history.

Explore data sets (which can be found in the references below) on the
statistics of casualties in war.  Look for a pedagogically useful
visualization of one or more data sets.

The two plots that could be made involved (a) the power law showing
the general behavior, and (b) trend plots throughout history.

A discussion of power laws can then be interesting.

Possible references:

* "The Better Angels of Our Nature: Why Violence Has Declined" - by
  Steven Pinker.  Chapters: The statistics of deadly quarrels, Part
  1 : the timing of wars ; The statistics of deadly quarrels, Part 2 :
  the magnitude of wars.

* Aaron Clauset: "Trends and fluctuations in the severity of
  interstate wars" -
  https://www.science.org/doi/10.1126/sciadv.aao3580

* Cunen, Hjort, Nygard: "Statistical sightings of better angels:
  Analysing the distribution of battle-deaths in interstate conflict
  over time".  Article at:
  https://journals.sagepub.com/doi/10.1177/0022343319896843 and
  replication data at:
  https://www.prio.org/journals/jpr/replicationdata


Quantifying overfitting in personal decisions
---------------------------------------------

[...]


Just about anything from Gwern Branwen
--------------------------------------

Gwern Branwent's web site has a vast collection of research interests,
mostly related to social science.  It would be good to distill
projects from them.

https://www.gwern.net/

https://www.gwern.net/Archiving-URLs

Sports
======

Time series for improvement on records
--------------------------------------

Start out by having a pure poisson process to generate improvement in
"world records".  For example, try something like:

.. code-block:: python

   import numpy as np
   s = np.random.poisson(5, 10000)

as shown at
https://numpy.org/doc/stable/reference/random/generated/numpy.random.poisson.html

Keep track of when a new sample beats all previous records, and study
the time between improvements on world records.

Then:

Examine the history of improvement in individual world records, for
example in the 100m swimming (both female and male), 100m running,
high jump, ...  Compare to the pure Poisson distribution.

In chess examine the history of surpassing "highest ever" Elo ratings
for (a) top player, (b) top 10 average, (c) top 100 average, and look
at when previous records get broken.  Try the same with alternative
chess metrics.

Try to see if this hypothesis is valid: improvements in world records
are randomly distributed (maybe Poisson? how about other
distributions?)  But sometimes you have a jump sooner than expected by
the statistics.  This could correspond to a "freak of nature" player
(but the statistics could also generate such players.  More
interesting are advances in training technique and in equipment.  Also
of note might be an increase in the population that plays that sport.


Optimal tournament structure
----------------------------

Create a group of (say) 32 virtual tournament players, and assign them a
rating, similar to chess Elo ratings.

Then set them up in double-round-robin, round-robin, swiss system, and
direct knockout.

Examine the consistency of the results.

Ask couple of questions (and maybe more):

* What is the best tournament format for various situations?
* How many places can be determined (just 1st? 1st, 2nd, 3rd? ...)
  using the double-round-robin as the gold standard.


Soccer analytics
----------------

See what you can do starting from this article about Messi in the 538
blog:

https://fivethirtyeight.com/features/lionel-messi-is-impossible/


Physical sciences
=================

Brownian motion
---------------

See if it is straightforward to write an agent-based model (or other
simulation) for brownian motion.

Then see if Einstein's 1905 results on Brownian motion can be derived
from this simulation.

Discuss, with these results in hand, how all of this relates to the
existence of atoms.

https://en.wikipedia.org/wiki/Brownian_motion


Life sciences
=============

Datasets for phylogenetic analysis
----------------------------------

Subject areas: biology, ecology, linguistics

Find specific datasets and create examples of phylogenetic trees.

These are more likely to come from biology, but they could also come
from linguistics or other areas of social science.

Start with canned examples, like in the 2021 Medium article by Rishika
Gupta, and learn to make phylogenetic trees (including going beyond
her examples).

Then look for coded datasets from language or mythology or other
social science sources.

Possible references:

* https://biopython.org/wiki/Phylo

* http://etetoolkit.org/

* https://medium.com/geekculture/phylogenetic-trees-implement-in-python-3f9df96c0c32

* https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04274-6


Predator-prey ecology: equations and agents
-------------------------------------------

Subject areas: biology, ecology

Craft a predator-prey scenario in two different ways: (a) by solving
the Lotka-Volterra equations as shown in Almond Heil's model
:numref:`chap-ecology`, and (b) using the agent-based modeling system
shown in :numref:`chap-agent-based-modeling-basics`

Start by adapting Almond Heil's model to predator-prey interaction (you
might also get this from the basic Mesa examples).

Then try to set up the same model in a Lotka-Volterra setting: solve
the Lotka-Volterra model with a differential equation solver, and see
if its cyclic population behavior matches what you get from the
agent-based model.

After looking at data and plots, do some thinking about the
assumptions that go in to both kinds of models, and try to make them
match more closely.

If you cannot get a close match, do some thinking on the mechanisms
behind both approaches and try to explain the differences.

Possible references:

* :numref:`chap-agent-based-modeling-basics`

* https://mesa.readthedocs.io/en/latest/

* https://towardsdatascience.com/introduction-to-mesa-agent-based-modeling-in-python-bcb0596e1c9a

* https://github.com/projectmesa/mesa


Infectious disease modeling
---------------------------

Subject areas: biology, epidemiology

The basic model used for infectious diseases is the SIR model, where
you track "susceptible", "infected", and "removed" population members.

You can also model it with agent-based models.

So the idea here is to look at the SIR model in two different
ways: (a) using the agent-based modeling system shown in
:numref:`chap-agent-based-modeling-basics`, and (b) by solving the SIR
differential equations.

Start by adapting Almond Heil's model to have variables that track the
full S, I, and R populations, and color them appropriately.

Then set up the SIR differential equations to have the same starting
values, and run them with a simple differential equation solver - you
can write your own Euler's method solver, or use the scipy runge-kutta
solver.

Possible references:

* :numref:`chap-agent-based-modeling-basics`

* https://www.maa.org/press/periodicals/loci/joma/the-sir-model-for-spread-of-disease-the-differential-equation-model

* https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology

* https://www.frontiersin.org/articles/10.3389/fams.2020.571544/full

* https://www.niss.org/sites/default/files/SIR_Modeling_tutorial_ob.pdf

* https://mesa.readthedocs.io/en/latest/

* https://towardsdatascience.com/introduction-to-mesa-agent-based-modeling-in-python-bcb0596e1c9a

* https://github.com/projectmesa/mesa


Mathematics
===========

puzzles
-------

[must flesh this out]

Jason Breshears at https://archaics.com/ has a youtube video where he
presents what's special for the number 2178 in his observation.

On the role of intuition in mathematics
---------------------------------------

Read in depth this article on mathematical intuition by famous
mathematician Terence Tao:

https://terrytao.wordpress.com/career-advice/theres-more-to-mathematics-than-rigour-and-proofs/

and try to devise a simple problem in mathematics (maybe one of the
ones below) that could illustrate the path from pre-rigorous to
rigorous to post-rigorous stages in one's mathematical journey.

Monty Hall's door problem
-------------------------

Write a simulation of the famous Monty Hall door problem.  The
simulation should be a very simple programming task.

This could be done with a language you already know, or as an exercise
to learn a new programming language, since the simulation will be very
simple.

* Leonard Mlodinow "The Drunkard's Walk", chapter 3 "Finding Your Way
  Through a Space of Possibilities"

* https://www.youtube.com/watch?v=AD6eJlbFa2I

* https://www.youtube.com/watch?v=4Lb-6rxZxx0


Random walks
------------

Subject areas: mathematics

Look at the theory of random walks and compare it to simulations.

Possible references: the chapter on randomness and disorder,
:numref:`chap-randomness-disorder`

Runge-kutta method
------------------

Subject areas: mathematics

Compare the runge-kutta library (gsl or scipy) result to a hand-coded
Euler method.  Compare first-point, mid-point, and last-point of
Euler's method to runge-kutta.

The compare them all to the exact solution for the nonlinear
pendulum, and show the difference.


Sync and the Kuramoto model
---------------------------

Subject areas: mathematics

Look at the simple mathematical models for sync described in Steven
Strogatz's lecture:

https://youtu.be/RpU7JrE1uCk

and implement the simple mathematical models that he shows.

He discusses the Kuramoto model at about 40 minutes into the video,
and he shows an example of a program that visualizes that model.

Learn the model, and then figure out how to implement it in python,
starting with text output and then using a simple widget set for the
visualization.

Possible references:

* https://youtu.be/RpU7JrE1uCk

* https://en.wikipedia.org/wiki/Kuramoto_model

* https://www.complexity-explorables.org/explorables/ride-my-kuramotocycle/

* http://www.complexity-explorables.org/slides/ride-my-kuramotocycle/

* https://www.ted.com/talks/steven_strogatz_the_science_of_sync


Analysis of chess
-----------------

No articulation of a project yet, but the ideas are approximately:

* How does the advantage of the white pieces change with player
  ratings, beginner to grandmaster?  And to time controls?

* What are the standard deviations of victory predictions based on
  ratings?

Possible sources:

* lichess.org API

* https://content.iospress.com/articles/icga-journal/icg0012

* https://en.wikipedia.org/wiki/Comparison_of_top_chess_players_throughout_history

* https://www.quora.com/Has-Stockfish-or-any-other-top-chess-program-performed-an-analysis-of-the-Karpov-Kasparov-matches-For-me-that-was-the-pinnacle-of-chess-though-probably-Carlsen-would-beat-either-i-e-was-Muhammed-Ali-better-than-Joe

* https://lichess.org/blog/YafSBxEAACIAr0ZA/exact-exacting-who-is-the-most-accurate-world-champion

* Understanding distributions of chess performances.  Study and
  reproduce this paper by Regan:

https://cse.buffalo.edu/~regan/papers/pdf/

https://cse.buffalo.edu/~regan/papers/pdf/RMH11.pdf

(There is also an RMH11b.pdf -- which is the more recent version?)

Also look at this quora answer that mentions it:
https://www.quora.com/What-is-the-most-important-move-in-chess/answer/H%C3%A5kon-Hapnes-Strand?ch=10&oid=340561749&share=7fabf913&srid=3MvD&target_type=answer


Computer science and information technology
===========================================

Situational awareness of your network
-------------------------------------

Subject areas: IT, computer programming

The idea here is to understand what devices are on your network, and
what services they offer.

Start by familiarizing yourself with the commands in the "How to get
insight into a network" chapter of the Sysadmin Hacks book at at
https://markgalassi.codeberg.page/sysadmin-hacks-html/ (you might want
to look at the previous chapter "The basics of networking" for
background).

In particular the sections on the `nmap` and `pnscan` commands are the
starting point for what we want to do.  Observe the command line
output for each of them with the examples shown in the book.

Then separately install the program `etherape` and look at what it
does.

Finally, look at the graphical output from the program
`map_network_utils.py` in that chapter.

At this point you can take two different directions: one is to keep
using separate utilities to visualize the network, in this case the
powerful `graphviz` tool.  Graphviz allows a lot of very fancy layout
options, and quite a bit more information about the network can be
added to the display in the form of logos for the operating system
(Linux, android, ...), or for the protocol (ssh, imap, ...), or one
could try to probe traffic information and visualize that.

* https://graphviz.org/gallery/


Humanities and the arts
=======================

Music generation: tone beyond sin waves
---------------------------------------

Topics: music

Take the simple sine wave generators shown in
:numref:`chap-music-basics` and investigate how to make the tone more
like that of a guitar string.

This will involve moving from a `sin()` wave to a more complex
funciton.  One start could be to add the harmonics to make a sawtooth
wave or a triangle wave, but you should do some research to find out
what simple tone generators sound interesting.

Music generation: add stereo
----------------------------

Topics: music

Take the simple sine wave generators shown in
:numref:`chap-music-basics` and investigate how to add an artificial
sterephonic feeling.

Right now the examples generate identical left and right channel
signals.  The idea here is to look at how to modify them slightly in
various ways, and to see if that creates a stereo effect.

After a bit of tinkering and experimentation you could then do
research to see if there is any known mathematics for doing this, and
implement those techniques.

Further explorations into Zipf's law
------------------------------------

Subject areas: literature, digital humanities

Start from the simple programs that demonstrate Zipf's law in
:numref:`power-laws-zipf-benford`.  Then read Isabella Trejo's
presentation from the Institute for Computing in Research at
https://github.com/izzytrejo/Zipf to understand some of the proposed
underpinnings for Zipf's law in literary text.

Then investigate some mathematical properties of the various
translations, such as the power law slopes and the power law breaks.

Does this happen consistently as you look at the translation of
different books into the same other languages?

How does this depend on the translator?  For example, can you find
examples of the same translator going between the same languages on
different books?  How about different translators on the same books to
the same other languages?

This might reveal Zipf's law to be an interesting tool for analyzing
style and/or language, or it might show that it is not sufficient to
get much insight.  If the latter, then the next step might be to
research what other techniques are used in digital humanities for this
kind of analysis.

Possible references:

* Courtney Lawton's dissertation:
  https://digitalcommons.unl.edu/dissertations/AAI10831194/
  (we need to find the full book)

* Isabella Trejo's project: https://github.com/izzytrejo/Zipf


Analyzing wordle
----------------

The popular game wordle has spawned many variants, as well as
applications to other languages.

In analogy to "opening theory" in chess, it would be interesting to
develop a repertoire of best "starting words".

We can try to use shell commands (like grep with regular expressions)
and the ``/usr/share/dict/words`` text file to learn as much as
possible about optimal first word choices.  We can switch to a full
python program once we hit the limits of this approach.

The candidate best words should then be tested on a large monte-carlo
sampling of possible challenge words to see which works better.

This can then be expanded in interesting ways to (a) apply to quordle
(4 words simultaneously) and larger variants, or (b) to use
psychological factors on guessing words to fine-tune the opening
words.

To give a bit of a start to see how easy it is to do a command-line
study of wordle:

.. code-block:: bash

   # what are all the 5-letter entries in the dictionary?
   grep '^.....$' /usr/share/dict/words
   # how about if you exclude names and punctuation?
   egrep '^.....$' /usr/share/dict/words | egrep -v '[^a-z]'
   # many entries are there?
   egrep '^.....$' /usr/share/dict/words | egrep -v '[^a-z]' | wc
   # what words have 'y' in the 3rd entry?
   egrep '^.....$' /usr/share/dict/words | egrep -v '[^a-z]' | grep '..y..'
   # how many words have 'y' in the 3rd entry?
   egrep '^.....$' /usr/share/dict/words | egrep -v '[^a-z]' | grep '..y..' | wc
   # how many words start with 'f' and have 'y' in the 3rd entry?
   egrep '^.....$' /usr/share/dict/words | egrep -v '[^a-z]' | grep
   'f.y..' | wc

Possible references for word lists:

* look at the file /usr/share/dict/words
* https://runestone.academy/ns/books/published/fopp/Projects/common_words.html#common-words


Possible references on other such games:

* https://www.nytimes.com/games/wordle/index.html
* https://worldle.teuteuf.fr/
* https://worldledaily.com/
* https://www.quordle.com/#/
* https://world3dmap.com/duotrigordle-game/
* https://citydle.com/
* https://globle-game.com/
* https://globle-game.com/
* https://greggblanchard.com/statle/
* https://plurality.fun/
* https://googlemapsmania.blogspot.com/2022/06/the-top-10-wordle-like-map-games.html
* https://www.online-tech-tips.com/cool-websites/11-best-sites-to-play-geography-games-online/
* https://locatle.strct.net/
* https://statle.us/
* https://outflux.net/statele/
* https://www.reddit.com/r/geoguessr/comments/v5ihla/locatle_is_a_mix_between_geoguessr_and_wordle/
* explordle
* https://oec.world/en/tradle/


Can generative AI make art or music with a simple project?
----------------------------------------------------------

Topics: music, visual art

Investigate the various AI engines for which people have written
generative AI modules (I think there are some for both tensorflow and
pytorch).  Only look at free/open-source modules.

Explore the results of generative art and music for those modules, and
then see if it is possible to give a simple procedure for someone to
start from scratch and end up with running programs that do the work.