31. Getting to philosophy

[status: content-mostly-written]

31.1. Motivation, prerequisites, plan

Motivation

Go to any Wikipedia page and follow the first link in the body of its text, and then you follow the first link of that page, and so forth. For almost all Wikipedia pages this procedure will eventually lead you to the Wikipedia page on Philosophy. This observation has its own wikipedia page:

https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy

Note

When we say “first link” on a wikipedia page, we mean the first link of the article content, after all the “for other uses”, “(from Greek …)”, and other frontmatter – these are not part of the article itself.

This is not a rigorous or deep observation, but it allows us to write some software to analyze and visualize this assertion, and that journey will teach us some very cool programming techniques.

  • Explore the “Getting to Philosophy” observation.

  • Learn how to do a bit of web scraping and text manipulation.

  • Use recursive programming for a real world application.

  • Learn about the remarkable graphviz software.

Prerequisites

  • The 10-hour “serious programming” course.

  • The “Data files and first plots” mini-course in Section 2.

  • Recursion from Section 20.

  • Web scraping from Section 30.

Plan

So how do we write programs that study and visualize this idea? We will:

  1. Review what web pages look like.

  2. Write programs that retrieve and pick apart web pages looking for links.

  3. Learn about graphviz.

  4. Use graphviz to analyze the flow of links in our simple web pages.

  5. Make those programs more subtle to search through the more complex HTML structure in Wikipedia articles.

  6. Output the “first link” chain in various Wikipedia pages to a file so that graphviz can show us an interesting visualization of the that chain.

31.2. Parsing simple web pages

You should quickly review the brief section on what web pages look like in Section 30.2 before continuing in this section.

Let us start with the simple web page we had in Listing 30.2.1 back in Section 30.2

Now write a program which finds the first hyperlink in a web page. There are many ways of doing this using sophisticated Python libraries, but we will start with a simple approach that simply uses Python’s string methods. An example is in Listing 31.2.1.

Listing 31.2.1 Look through the text of a page for the first hypertext link.
#! /usr/bin/env python3

import sys

def main():
    ## get the entire html text from the file
    f = open(sys.argv[1])
    text = f.read()
    f.close()
    pos, link_url = find_first_link(text)
    print('pos, link URL:', pos, link_url)
    print('last_part:', url2last_part(link_url))

    link_ending = url2last_part(link_url)

def find_first_link(text):
    """Finds the first hyperlink in a string of HTML (which might be the
    entire HTML file).  Returns a pair with the position of the first
    link and the href attribute of the link.
    """
    ## search for bits of string that show we're at the start of an anchor
    start_of_anchor = text.find('<a')
    remaining_text = text[start_of_anchor:]
    ## now search for the start of the link, which comes right after href="
    start_of_link = remaining_text.find('href="') + len('href="')
    ## find the end of the link
    text_from_start_of_link = remaining_text[start_of_link:]
    end_of_link = text_from_start_of_link.find('"')
    ## now we have the end of the link, so we take a slice of text
    ## that ends there
    link_url = text_from_start_of_link[:end_of_link]
    ## finally let's keep track of the position in the file of the
    ## start of the link
    link_pos = start_of_anchor + start_of_link
    return link_pos, link_url


def url2last_part(url):
    """Take a URL and pick out the last part, without the .html"""
    url = url.strip()
    last_slash_ind = url.rfind('/')
    last_dot_ind = url.rfind('.')
    if last_slash_ind == -1:
        last_slash_ind = 0
    return url[last_slash_ind:last_dot_ind]

if __name__ == '__main__':
    main()

Running this program will give the position and text of the first hyperlink in that HTML file:

$ ./find_first_link.py myinfo.html
pos, link URL: 330 myinfo.html
last_part: myinfo

31.3. Making vertex and edge graphs

Graph can mean many things. In computer science it is a picture that shows connections between things. The “things” are shown as shapes and the connections are shown as lines or arrows.

There is a very cool program called graphviz which lets you make a simple text file and get a graph drawn from it. In Listing 31.4.1: there is a simple example that shows a bit of president Kennedy’s family tree:

Listing 31.3.1 The Kennedy family tree
## you can process this with
## dot -Tpdf -O kennedys.dot

digraph family_trees {
  Rose_Fitzgerald_Kennedy -> John_Fitzgerald_Kennedy;
  Mary_Josephine_Hannon -> Rose_Fitzgerald_Kennedy;
  Joseph_P_Kennedy_Jr -> John_Fitzgerald_Kennedy;
  Mary_Augusta_Hickey -> Joseph_P_Kennedy_Jr;
  Patrick_Joseph_Kennedy -> Joseph_P_Kennedy_Jr;
  John_F_Fitzgerald -> Rose_Fitzgerald_Kennedy;
}

You can then generate the picture with:

dot -Tsvg -O kennedys.dot
dot -Tpng -O kennedys.dot
dot -Tpdf -O kennedys.dot
../_images/kennedys.dot.svg

Figure 31.3.1 The immediate family tree of president Kennedy, rendered with graphviz.

You can see more elaborate and sometimes quite visually striking examples at the graphviz web site: http://www.graphviz.org/Gallery.php

You can see that it would be illustrative to make such a graph of the paths through Wikipedia pages.

But first let’s take some baby steps: to get more comfortable with how graphviz works, students should create their own .dot file with their own family tree. This requires some fast typing, but then they can process it with dot and view the picture generated by graphviz.

31.4. A program to get to philosophy

The program I show you here is quite elaborate because it has to deal with some possible scenarios that confuse the issue of which is the “first link” in a wikipedia page. We have provisions that:

  • exclude links that come in parentheses

  • exclude links before the start of the first paragraph

  • exclude links to wikipedia “meta pages”, those that start with File:, Help:, Wikipedia: and that end with .svg

In Listing 31.4.1 we get to see a couple of the types of algorithms we invent as we do this kind of text processing: the code counts the number of open parentheses that have not yet been closed.

Now enter the program in Listing 31.4.1:

Listing 31.4.1 Examine the “Getting To Philosophy” principle on wikipedia.
#! /usr/bin/env python3

"""Getting to philosophy: "scrape" a Wikipedia page and follow its
first link recursively to see if you end up at the Philosophy page
https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy

"""
import urllib.request
import os
import sys

## this is the Philosophy URL -- if we reach this we terminate
# philosophy_list = ['Philosophy', 'Philosophical', 'Existence', 'Semiotics', 'Logic', 'Knowledge']
# philosophy_list = ['Philosophy', 'Philosophical', 'Existence',
#                    'Semiotics', 'Semantics', 'Knowledge', 'Linguistics', 'Logic',
#                    'Reasoning']
# philosophy_list = ['Philosophy', 'Philosophical', 'Existence']
philosophy_list = ['Philosophy']

## this is the default list of topics we experiment with
topics_default = [
    'https://en.wikipedia.org/wiki/Xkcd',
    'https://en.wikipedia.org/wiki/GNU_Project',
    'https://en.wikipedia.org/wiki/Bertrand_Russell',
    'https://en.wikipedia.org/wiki/Plague_of_Justinian',
    'https://en.wikipedia.org/wiki/Spark_plug',
    'https://en.wikipedia.org/wiki/Quantum_entanglement',
    'https://en.wikipedia.org/wiki/Hipparchia_of_Maroneia',
    'https://en.wikipedia.org/wiki/Toilet_paper'
]

def main():
    topics = topics_default
    if len(sys.argv) > 1:
        # if user gives URLs on the command line then we use those
        # instead of the default topics
        topics = sys.argv[1:]
    if len(topics) > 1:
        graphviz_fname = 'gtp_graph.dot' # default output file
    else:
        ## if we request a single topic then we can use that as a
        ## filename
        graphviz_fname = topics[0].split('/')[-1] + '.dot'
    print('# GRAPHVIZ_FNAME:', topics, graphviz_fname)
    # canonicalize the filename to remove things like ':' and add .dot
    graphviz_fname = canonicalize_topic(graphviz_fname)
    ## give an error message if the program "dot" (from the package
    ## graphviz) is not available
    if not os.path.exists('/usr/bin/dot'):
        print('Error: the program "dot" does not seem to be installed;')
        print('you can install it with "sudo apt install graphviz"')
        print('and start again')
        sys.exit(1)
    start_graphviz_file(graphviz_fname)
    ## now analyze all the topics
    for topic_url in topics:
        print(f'INITIAL_TOPIC: {url2topic(topic_url)}')
        try:
            url_list = analyze_url([topic_url])
        except RecursionError:
            print(f'Recursion limit exceeded on {topic_url}')
            continue
        except RuntimeError:
            print(f'Recursion limit exceeded on {topic_url}')
            continue
        write_graph_file(url_list, graphviz_fname)
        ## now print some information about what we just did
        print(f'{url2topic(topic_url)} went through {len(url_list)} topics', end="")
        print(f' to reach {(url2topic(url_list[-1]))}')
    ## put the closing line in the graphviz file
    end_graphviz_file(graphviz_fname)
    print('graph information written to file %s' % graphviz_fname)
    ## now run graphviz (the command line is "dot") to make pdf, svg
    ## and png files
    os.system('dot -Tpdf -O %s' % graphviz_fname)
    os.system('dot -Tsvg -O %s' % graphviz_fname)
    os.system('dot -Tpng -O %s' % graphviz_fname)
    print('used "dot" to generate the files %s, %s, %s' 
          % (graphviz_fname + '.pdf', graphviz_fname + '.svg', 
             graphviz_fname + '.png'))

def analyze_url(urls_so_far):
    """This function analyzes a URL.  We first grab the "next" URL (the
    first link in the page).  If the URL is the arrival point
    (i.e. the Philosophy article) then we return right away with the
    list of URLs visited so far.  If the URL has already appeared
    before then we declare we are in a loop.  If we have had more than
    100 URLs then we return without analyzing further.  The above were
    all terminations, but if *none* of those conditions happen then we
    recursively call this function again to analyze the next URL.
    """
    url = urls_so_far[-1]           # analyze the last one added
    # before we analyze it, first see if they just gave the topic
    # without the full https:// URL
    wikipedia_prefix = 'https://en.wikipedia.org/wiki/'
    if not url.startswith(wikipedia_prefix):
        url = wikipedia_prefix + url
    # then do the analysis recursively
    page_html = urllib.request.urlopen(url).read()
    # next_url = analyze_page(url, str(page_html))
    next_url = analyze_page(url, page_html.decode('utf-8'))
    urls_so_far.append(next_url)
    ## print it out - we pad it with zeros and then end it with \r
    ## instead of \n so that we get that cheap animation feel
    print(f'{url2topic(urls_so_far[0])} -- HOP {len(urls_so_far)} -- {url2topic(next_url)}' + ' '*20,
          end="\r")
    if url2topic(next_url).strip('/') in philosophy_list:
        return (urls_so_far)
    elif urls_so_far.count(next_url) > 1:
        return (urls_so_far + urls_so_far[-2:])
    elif len(urls_so_far) > 100:
        return (urls_so_far)
    else:
        return analyze_url(urls_so_far)

def analyze_page(master_url, page_html):
    """Finds the first href (hyptertext link) in the given page."""
    first_href = find_first_href_after_paragraph(master_url, page_html)
    first_href = 'https://en.wikipedia.org%s' % first_href
    return first_href

def find_first_href_after_paragraph(master_url, page_html):
    """Find the first hyperlink after the first <p> tag in the document.
    This is becuase in wikipedia the article content actually starts
    with a <p> tag after all the warnings and other frontmatter have
    been put out.
    """
    # first_p_ind = page_html.find('<p><b>')
    first_p_ind = page_html.find('<p>')
    html_after_p = page_html[first_p_ind:]
    anchor_split = html_after_p.split('</a>')
    anchor_tag = '<a href="'
    endtag = '"'
    end = 0                   # FIXME: what should the end default be?
    ## FIXME: must exclude the "warning" type of text, which might be
    ## enclosed in this kind of tags: <td class="mbox-text">
    open_parentheses_until_here = 0
    open_paragraphs_until_here = 0
    for i, anchor_text in enumerate(anchor_split):
        if anchor_tag in anchor_text:
            # ind = anchor_text.index(anchor_tag)
            base_pos = html_after_p.find(anchor_text)
            pos_after_anchor = anchor_text.find(anchor_tag)
            ## we must also exclude URLs that come up in parentheses,
            ## so we must review all the text leading up to the URL
            ## for open parentheses
            open_parentheses_until_here = count_open_parentheses(master_url, html_after_p, 
                                                                 base_pos + pos_after_anchor)
            open_paragraphs_until_here = count_open_paragraphs(master_url, page_html,
                                                               first_p_ind + base_pos + pos_after_anchor)
            ## trim the text
            anchor_text = anchor_text[pos_after_anchor + len(anchor_tag):]
            try:
                end = anchor_text.index(endtag)
            except:
                break
        href_url = anchor_text[:end]
        if not href_url.startswith('/wiki/'):
            continue            # we only look at /wiki links
        if open_parentheses_until_here > 0:
            continue            # skip anchors that are in parentheses
        if open_paragraphs_until_here <= 0:
            continue
        ## there only some URLs we consider: those that don't start
        ## with wiki ('cause they point within wikipedia), those that
        ## end with html (otherwise we'd be getting images), ...
        if (href_url.startswith('/wiki/')
            and not href_url.endswith('.svg')
            and not href_url.startswith('/wiki/File:')
            and not href_url.startswith('/wiki/Help:')
            and not href_url.startswith('/wiki/Wikipedia:')):
            return anchor_text[:end]
    assert(False)               # we should never get here


def write_graph_file(url_list, graphviz_fname):
    """write our list of URLs to a graphviz file"""
    with open(graphviz_fname, 'a') as f:
        prev_topic = url2topic(url_list[0])
        for url in url_list[1:]:
            brief_topic = url2topic(url)
            f.write('    "%s" -> "%s";\n' 
                    % (canonicalize_topic(prev_topic),
                       canonicalize_topic(brief_topic)))
            prev_topic = brief_topic
            f.flush()

def start_graphviz_file(fname):
    """put opening information for graphviz at the start of a file"""
    with open(fname, 'w') as f: # zero it out
        f.write('digraph gtp {\n')

def end_graphviz_file(fname):
    """put closing/footer information at the end of a graphviz file"""
    with open(fname, 'a') as f:
        f.write('}\n')

def url2topic(url):
    """Takes a wikipedia URL and strips the boiler plate information to
    give just the name of the topic"""
    # last_slash = url.rfind('/')
    # brief_topic = url[last_slash+1:]
    brief_topic = url.split('/')[-1].strip('/')
    return brief_topic

def canonicalize_topic(topic):
    result = topic
    ## first change the %xx escape sequences used by http URLs back to
    ## their single characters
    result = urllib.parse.unquote(result)
    ## then remove parentheses and hashtags and dashes, replacing them
    ## with underscores
    result = result.replace('(', '_')
    result = result.replace(')', '_')
    result = result.replace('#', '_')
    result = result.replace('-', '_')
    result = result.replace(':', '_')
    return result


def count_open_parentheses(master_url, text, ind):
    """counts how many levels of parentheses are open leading up to this
    index in the text"""
    n_open = 0
    for i, char in enumerate(text[:ind+1]):
        if char == '(':
            n_open += 1
        if char == ')':
            n_open -= 1
    return n_open

def count_open_paragraphs(master_url, text, ind):
    """counts how many levels of parentheses are open leading up to this
    index in the text"""
    n_open = 0
    for i in range(len(text[:ind+1])):
        if text[i:i+3] == '<p>' or text[i:i+3] == '<p ':
            n_open += 1
        if text[i:i+4] == '</p>':
            n_open -= 1
    return n_open

main()

If you run:

$ python3 gtp.py

The results can be seen in Figure 31.4.1.

../_images/gtp_graph.dot.svg

Figure 31.4.1 A graph that shows what happens when you keep clicking the first link in a Wikipedia page. This often ends up in the Wikipedia entry on Philosophy.

You can also run python3 gtp.py with one or more arguments. These arguments can be full Wikipedia URLs or they can be just the final portion. For example:

$ chmod 755 gtp.py
$ ./gtp.py https://en.wikipedia.org/wiki/Asterix
$ evince Asterix.pdf &

or, alternatively:

$ chmod 755 gtp.py
$ ./gtp.py Asterix
$ evince Asterix.pdf &
../_images/Asterix.dot.svg

Figure 31.4.2 The chain of first clicks starting at Asterix, obtained with ./gtp.py Asterix – it is amusing to note that the chain passes through the article on Logic.

31.5. When things go wrong

Note

Wikipedia pages can change for several reasons. These include the ordinary editing of pages, as well as the media wiki software that generates the web site from the original wiki markup.

At this time (2023-10-05) the examples I give below show possible failures in the gtp.py program, but at another time these might have been fixed. Still, it is likely that there will always be wikipedia pages that break the assumptions made here.

Ordinary wikipedia articles seem to start the main line text with a <p> element, which has helped us use the simple instruction:

first_p_ind = page_html.find('<p>')

to find the start of the useful text. But some wikipedia pages have a different structure, like list or topic pages.

But even some pages that are not special might break this: at the time of writing this section, the Complex system page is organized with a right side bar which has <p> elements in it, and these come before the main text.

So running ./gtp.py Complex_system goes to Collective_intelligence instead of system which the ends up taking us into a loop with no progress:

$ chmod 755 gtp.py
$ ./gtp.py Complex_system
$ evince Complex_system.pdf &
../_images/Complex_system.dot.svg

Figure 31.5.1 The chain of first clicks starting at Complex_system, obtained with ./gtp.py Complex_system. This is a failure of our program:

31.6. When we simply don’t “get to philosophy”

Sometimes an article just breaks the mold. At the time in which I wrote an earlier version of this section, Roman_Empire would loop back and forth to Octavian.

While this might be semi-humorously seen as an insightful comment by the “getting to philosophy” meme, it is worth noting that our software had worked well: if you looked at the articles on Roman Empire and Octavian at that time you would have seen that they do indeed reference each other as first links.

So this was a failure of the meme, not of our program.

As it turns out, at the time of revising (2023-11-06) I find that the Roman Empire article has been revised to start with a link to the Roman Republic, rather than first linking to Octavian. This restores the Getting to Philosophy meme for “Roman Empire”, although we can expect this to occur in other articles.

In Figure 31.6.1 I show the graph I had gotten at that time.

../_images/Roman_Empire.dot.svg

Figure 31.6.1 The chain of first clicks starting at Roman_Empire, obtained with ./gtp.py Roman_Empire on October 10 2023 when the article had a different first link. This was not a failure of our program: it was simply a different structuring of the Wikipedia articles by their authors.