.. _chap-getting-to-philosophy:

=======================
 Getting to philosophy
=======================

[status: content-mostly-written]


Motivation, prerequisites, plan
===============================


.. rubric:: Motivation

Go to any Wikipedia page and follow the first link in the body of its
text, and then you follow the first link of that page, and so forth.
For almost all Wikipedia pages this procedure will eventually lead you
to the Wikipedia page on Philosophy.  This observation has its own
wikipedia page:

https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy

.. note::

   When we say "first link" on a wikipedia page, we mean the first
   link of the article content, after all the "for other uses", "(from
   Greek ...)", and other frontmatter -- these are not part of the
   article itself.

This is not a rigorous or deep observation, but it allows us to write
some software to analyze and visualize this assertion, and that
journey will teach us some very cool programming techniques.

* Explore the "Getting to Philosophy" observation.
* Learn how to do a bit of *web scraping* and text manipulation.
* Use recursive programming for a real world application.
* Learn about the remarkable ``graphviz`` software.

.. rubric:: Prerequisites


* The 10-hour "serious programming" course.
* The "Data files and first plots" mini-course in
  :numref:`chap-data-files-and-first-plots`.
* Recursion from :numref:`chap-recursion`.
* Web scraping from :numref:`chap-web-scraping`.


.. rubric:: Plan

So how do we write programs that study and visualize this idea?  We
will:

#. Review what web pages look like.
#. Write programs that retrieve and pick apart web pages looking for
   links.
#. Learn about graphviz.
#. Use graphviz to analyze the flow of links in our simple web pages.
#. Make those programs more subtle to search through the more complex
   HTML structure in Wikipedia articles.
#. Output the "first link" chain in various Wikipedia pages to a file
   so that graphviz can show us an interesting visualization of the
   that chain.


Parsing simple web pages
========================

You should quickly review the brief section on what web pages look like in
:numref:`sec-what-does-a-web-page-look-like-underneath` before
continuing in this section.

Let us start with the simple web page we had in
:numref:`listing-simple-web-page-with-anchor` back in
:numref:`sec-what-does-a-web-page-look-like-underneath`

Now write a program which finds the first hyperlink in a web page.
There are many ways of doing this using sophisticated Python
libraries, but we will start with a simple approach that simply uses
Python's string methods.  An example is in
:numref:`listing-find-first-anchor`.

.. _listing-find-first-anchor:

.. literalinclude:: find_first_link.py
   :language: python
   :caption: Look through the text of a page for the first hypertext link.

Running this program will give the position and text of the first
hyperlink in that HTML file::

   $ ./find_first_link.py myinfo.html
   pos, link URL: 330 myinfo.html
   last_part: myinfo


Making vertex and edge graphs
=============================

*Graph* can mean many things.  In computer science it is a picture
that shows connections between things.  The "things" are shown as
shapes and the connections are shown as lines or arrows.

There is a very cool program called ``graphviz`` which lets you make a
simple text file and get a graph drawn from it.  In
:numref:`listing-gtp-py`: there is a simple example that shows a bit
of president Kennedy's family tree:

.. _listing-kennedys-py:

.. literalinclude:: kennedys.dot
   :caption: The Kennedy family tree

You can then generate the picture with:

.. parsed-literal::

   dot -Tsvg -O kennedys.dot
   dot -Tpng -O kennedys.dot
   dot -Tpdf -O kennedys.dot

.. _fig-kennedys:

.. figure:: kennedys.dot.*

   The immediate family tree of president Kennedy, rendered with
   graphviz.

You can see more elaborate and sometimes quite visually striking
examples at the graphviz web site: http://www.graphviz.org/Gallery.php

You can see that it would be illustrative to make such a graph of the
paths through Wikipedia pages.

But first let's take some baby steps: to get more comfortable with how
graphviz works, students should create their own ``.dot`` file with
their own family tree.  This requires some fast typing, but then they
can process it with ``dot`` and view the picture generated by
graphviz.


A program to get to philosophy
==============================

The program I show you here is quite elaborate because it has to deal
with some possible scenarios that confuse the issue of which is the
"first link" in a wikipedia page.  We have provisions that:

* exclude links that come in parentheses

* exclude links before the start of the first paragraph

* exclude links to wikipedia "meta pages", those that start with
  ``File:``, ``Help:``, ``Wikipedia:`` and that end with ``.svg``

In :numref:`listing-gtp-py` we get to see a couple of the types of
algorithms we invent as we do this kind of text processing: the code
counts the number of open parentheses that have not yet been closed.

Now enter the program in :numref:`listing-gtp-py`:

.. _listing-gtp-py:

.. literalinclude:: gtp.py
   :language: python
   :caption: Examine the "Getting To Philosophy" principle on
             wikipedia.

If you run:

.. code-block:: console

   $ python3 gtp.py

The results can be seen in :numref:`fig-gtp_graph`.

.. _fig-gtp_graph:

.. figure:: gtp_graph.dot.*

   A graph that shows what happens when you keep clicking the first
   link in a Wikipedia page.  This often ends up in the Wikipedia
   entry on `Philosophy <http://en.wikipedia.org/wiki/Philosophy>`_.

You can also run ``python3 gtp.py`` with one or more arguments.  These
arguments can be full Wikipedia URLs or they can be just the final
portion.  For example:

.. code-block:: console

   $ chmod 755 gtp.py
   $ ./gtp.py https://en.wikipedia.org/wiki/Asterix
   $ evince Asterix.pdf &

or, alternatively:

.. code-block:: console

   $ chmod 755 gtp.py
   $ ./gtp.py Asterix
   $ evince Asterix.pdf &

.. figure:: Asterix.dot.*
   :width: 20%

   The chain of first clicks starting at Asterix, obtained with
   ``./gtp.py Asterix`` -- it is amusing to note that the chain passes
   through the article on Logic.

When things go wrong
====================

.. note::

   Wikipedia pages can change for several reasons.  These include the
   ordinary editing of pages, as well as the *media wiki* software
   that generates the web site from the original wiki markup.

   At this time (2023-10-05) the examples I give below show possible
   failures in the ``gtp.py`` program, but at another time these might
   have been fixed.  Still, it is likely that there will always be
   wikipedia pages that break the assumptions made here.

Ordinary wikipedia articles seem to start the main line text with a
``<p>`` element, which has helped us use the simple instruction:

.. code-block:: python

   first_p_ind = page_html.find('<p>')

to find the start of the useful text.  But some wikipedia pages have a
different structure, like list or topic pages.

But even some pages that are not special might break this: at the time
of writing this section, the *Complex system* page is organized with a
right side bar which has ``<p>`` elements in it, and these come before
the main text.

So running ``./gtp.py Complex_system`` goes to
``Collective_intelligence`` instead of ``system`` which the ends up
taking us into a loop with no progress:

.. code-block:: console

   $ chmod 755 gtp.py
   $ ./gtp.py Complex_system
   $ evince Complex_system.pdf &

.. figure:: Complex_system.dot.*
   :width: 20%

   The chain of first clicks starting at Complex_system, obtained with
   ``./gtp.py Complex_system``.  This is a failure of our program: 


When we simply don't "get to philosophy"
========================================

Sometimes an article just breaks the mold.  At the time in which I
wrote an earlier version of this section, Roman_Empire would loop back
and forth to Octavian.

While this might be semi-humorously seen as an insightful comment by
the "getting to philosophy" meme, it is worth noting that our software
had worked well: if you looked at the articles on *Roman Empire* and
*Octavian* at that time you would have seen that they do indeed
reference each other as first links.

So this was a failure of the meme, not of our program.

As it turns out, at the time of revising (2023-11-06) I find that the
Roman Empire article has been revised to start with a link to the
Roman Republic, rather than first linking to Octavian.  This restores
the Getting to Philosophy meme for "Roman Empire", although we can
expect this to occur in other articles.

In :numref:`fig-roman-empire-loop` I show the graph I had gotten at
that time.

.. _fig-roman-empire-loop:

.. figure:: Roman_Empire.dot.*
   :width: 20%

   The chain of first clicks starting at Roman_Empire, obtained with
   ``./gtp.py Roman_Empire`` on October 10 2023 when the article had a
   different first link.  This was not a failure of our program: it
   was simply a different structuring of the Wikipedia articles by
   their authors.