29. Web scraping

[status: mostly-complete-needs-polishing-and-proofreading]

29.1. Motivation, prerequisites, plan

The web is full of information and we often browse it visually with a browser. But when we collect a scientific data set from the web we do not want to have a “human in the loop”, rather we want an automatic program to collect that data so that our results can be reproducible and our procedure can be fast and automatic.

Although my focus here is mainly on scientific applications, web scraping can also be used to mirror a web site.

Prerequisites

The 10-hour “serious programming” course.
The “Data files and first plots” mini-course in Section 2
You should install the program wget:
```
$ sudo apt install wget
```

Plan

Our plan is to find some interesting data sets on the web.

In our first approach in Section 29.3 we will download them to our disk using the command line program wget and plot them with gnuplot. Then in Section 29.4 we will show how you can retrieve data in your python program.

Finally in Section 29.5 we will scratch the surface of all the amazing scientific data sets that can be found on the web.

We will try to look at both time history and image data. Time histories are data sets where we look at an interesting quantity as it changes in time.

Examples of time histories include temperature as a function of time (in fact, all sorts of weather and climate data) and stock market prices as a function of time.

Examples of image data include telescope images of the sky and satellite imagery of the earth and of the sun.

29.2. What does a web page look like underneath? (HTML)

To introduce students to the staples of a web page, remember:

Not everyone knows what HTML is.
Few people have seen HTML.

So we introduce HTML (hypertext markup language) by example first, and then point out what “hypertext” and “markup” mean.

So I type up a quick html page, and the students watch on the projector and type their own. The page I put up is a simple hello page at first, then I add a link.

<html>
    <head>
        <title>A simple web page</title>
    </head>

    <body>
        <h1>Mark's web page</h1>
        <p>This is Mark's web page</p>
        <p>Now a paragraph with some <i>text in italics</i>
           and some <b>text in boldface</b>
        </p>
    </body>
</html>

Save this to a file called, for example, myinfo.html in your home directory and then view it by pointing a web browser to file:///home/MYLOGINNAME/myinfo.html (yes, there are three slashes in the file URL file:///...).

That simple web page lets me explain what I mean by markup: bits of text like  and  and <head> are not text in the document: they specify how the document should be rendered (for example  and  specify how the text should look,  breaks the text into paragraphs). Some of the tags don’t affect the text at all, but tell us how the document should be understood (for example the metadata tags <html> and <title>).

Then let’s add a hyperlink: a link to the student’s school. My html page now looks like:

Listing 29.2.1 A simple web page with an anchor (hyperlink) element in it.

<html>
    <head>
        <title>A simple web page</title>
    </head>

    <body>
        <h1>Mark's web page</h1>
        <p>This is Mark's web page</p>
        <p>Now a paragraph with some <i>text in italics</i>
           and some <b>text in boldface</b>
        </p>
        <p>Mark went to high school at
           <a href="http://liceoparini.gov.it/">Liceo Parini</a>
        </p>
    </body>
</html>

Then save and reload the page in your browser.

Here I’ve introduced the hyperlink. In HTML this is made up of an element called <a> (anchor) which has an attribute called href which has the URL of the hyperlink.

So as we write programs that pick apart a web page we now know what web pages look like. If we want to find the links in a web page we can use the Python string find() method to look for <a and then for </a> and to use the text in between the two.

29.3. Command line scraping with `wget`

In Section 8.1 we had our first glimpse of the command wget, a wonderful program which grabs a page from the web and puts the result into a file on your disk. This type of program is sometimes called a “web crawler” or “offline browser”.

wget can even follow links up to a certain depth and reproduce the web hierarchy on a local disk.

In areas with poor network connectivity people can use wget when there is a brief moment of good newtorking: they download all they need in a hurry, then point their browser to the data on their local disk.

29.3.1. First download with wget

Let us make a directory in which to work and start getting data.

$ mkdir scraping
$ cd scraping
$ wget https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv

We now have a file called drinks.csv - how do we explore it?

I would first use simple file tools:

less drinks.csv

shows lines like this:

country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
Afghanistan,0,0,0,0.0
Albania,89,132,54,4.9
Algeria,25,0,14,0.7
Andorra,245,138,312,12.4
Angola,217,57,45,5.9
## ...

If you like to see data in a spreadsheet you could try to use libreoffice or gnumeric:

libreoffice drinks.csv

29.3.2. Simple analysis of the `drinks.csv` file

Sometimes you can learn quite a bit about what’s in a file with simple shell tools, without using a plotting program or writing a data analysis program. I will show you a some things you can do with one line shell commands.

Looking at drinks.csv we see that the fourth column is the number of wine servings per capita drunk in that country. Let us use the command sort to order the file by wine consumption.

A quick look at the sort documentation with man sort shows us that the -t option can be used to use a comma instead of white space to separate fields. We also find out that the -k option can be used to specify a key and -g to sort numerically (including floating point). Put these together to try running:

sort -t , -k 4 -g drinks.csv

this will show you all those countries in order of increasing wine consumption, rather than in alphabetical order. To see just the last few 15 lines you can run:

sort -t , -k 4 -g drinks.csv | tail -15

This is a great opportunity to laugh at the confirmation of some stereotypes and the negation of others.

If you look at the last few lines you see that the French consume the most wine per capita, followed by the Portuguese.

If you sort by the 5th column you will see the overall use of alcohol and the 3rd column will show you the use of spirits (hard liquor) while the 2nd column shows consumption of beer.

29.3.3. Looking at birth data

$ wget https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv
$ tr '\r' '\n' < US_births_2000-2014_SSA.csv > births_2000-2014_SSA-newline.csv
$ gnuplot
gnuplot> set datafile separator ","
gnuplot> plot 'births_2000-2014_SSA-newline.csv' using 5 with lines

29.4. Scraping from a Python program

29.4.1. Brief interlude on string manipulation

$ python3
>>> s = 'now is the time for all good folk to come to the aid of the party'
>>> s.split()
['now', 'is', 'the', 'time', 'for', 'all', 'good', 'folk', 'to', 'come', 'to', 'the', 'aid', 'of', 'the', 'party']
# now we've seen what that looks like, save it into a variable
>>> words = s.split()
>>> words
['now', 'is', 'the', 'time', 'for', 'all', 'good', 'folk', 'to', 'come', 'to', 'the', 'aid', 'of', 'the', 'party']
>>>
# now try to split where the separator is a comma
>>> csv_str = 'name,age,height'
>>> words = csv_str.split()
>>> csv_str = 'name,age,height'
>>> words = csv_str.split()
>>> words
['name,age,height']
# didn't work; try telling split() to use a comma
>>> words = csv_str.split(',')
>>> words
['name', 'age', 'height']

29.4.2. The birth data from Python

Listing 29.4.1 get-birth-data.py - A program which downloads birth data.

#! /usr/bin/env python3

import urllib.request

day_map = {1: 'mon', 2: 'tue', 3: 'wed', 4: 'thu', 5: 'fri', 
           6: 'sat', 7: 'sun'}

def main():
    f = urllib.request.urlopen('https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv')
    ## this file has carriage returns instead of newlines, so
    ## f.readlines() won't work in all cases.  I read the whole
    ## file in, and then split it into lines
    entire_file = f.read()
    f.close()
    lines = entire_file.split()
    print('lines:', lines[:3])
    dataset = []
    for line in lines[1:]:
        # print('line:', line, str(line))
        line = line.decode('utf-8')
        words = line.split(',')
        # print(words)
        values = [int(w) for w in words]
        dataset.append(values)
    day_of_week_hist = process_dataset(dataset)
    print_histogram(day_of_week_hist)

def process_dataset(dataset):
    ## NOTE: the fields are:
    ## year,month,date_of_month,day_of_week,births
    print('dataset has %d lines' % len(dataset))
    ## now we form a histogram of births according to the day of the
    ## week
    day_of_week_hist = {}
    for i in range(1, 8):
        day_of_week_hist[i] = 0
    for row in dataset:
        day_of_week = row[3]
        month = row[1]
        n_births = row[4]
        day_of_week_hist[day_of_week] += n_births
    return day_of_week_hist

def print_histogram(hist):
    print(hist)
    keys = list(hist.keys())
    keys.sort()
    print('keys:', keys)
    for day in keys:
        print(day, day_map[day], hist[day])

main()

29.5. Finding neat scientific data sets

https://www.dataquest.io/blog/free-datasets-for-projects/ (they mention fivethirtyeight)

https://github.com/fivethirtyeight/data

29.5.1. Time histories

Temperature

Births

wget https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv

29.5.2. Images

NASA nebulae

Goes images of the sun

29.6. Beautiful Soup

Beautiful Soup is a powerful python package that allows you to scrape web pages in a structured manner. Unlike the code we have seen so far, which does brute-force parsing of html text chunks in Python, beautiful soup is aware of the “document object model” (DOM).

Start by installing the python package. You can probably install with pip, or on debian-based distributions you can run:

sudo apt install python3-bs4

Now enter the program billboard_hot_100_scraper_2023.py in Listing 29.6.1:

Listing 29.6.1 Download the Billboard Hot 100 list using Beautiful Soup.

#! /usr/bin/env python3

"""This program was inspired by Jaimes Subroto who had written a
program that worked with the 2018 billboard html format.  Billboard
has changed its html format quite completely in 2023, so this is a
re-implementation that handles the new format.
"""

import urllib
from bs4 import BeautifulSoup as soup

def main():
    url = 'https://www.billboard.com/charts/hot-100'
    # url = 'https://web.archive.org/web/20180415100832/https://www.billboard.com/charts/hot-100/'

    # boiler plate stuff to load in an html page from its URL
    url_client = urllib.request.urlopen(url)
    page_html = url_client.read()
    url_client.close()

    # let us save it to a local html file, using utf-8 decoding so
    # that we turn the byte stream into simple ascii text
    open('page_saved.html', 'w').write(page_html.decode('utf-8'))

    # boiler plate use of beautiful soup: use the html parser on the file
    page_soup = soup(page_html, "html.parser")

    # now for the part where you need to know the structure of the
    # html file.  by inspection I found that in 2023 they use <ul>
    # list elements with the attribute "o-chart-restults-list-row", so
    # this is how you find those elements in beautiful soup:
    list_elements = page_soup.select('ul[class*=o-chart-results-list-row]') # *= means contains
    # now that we have our list are ready to read things in, we also prepare 
    outfname = 'billboard_hot_100.csv'
    with open(outfname, 'w') as fp:
        headers = 'Song, Artist, Last Week, Peak Position, Weeks on Chart\n'
        fp.write(headers)
        # Loops through each list element
        for element in list_elements:
            handle_single_row(element, fp)
    print(f'\nBillboard hot 100 table saved to {outfname}')

def handle_single_row(element, fp):
    all_list_items = element.find_all('li')
    title_and_artist = all_list_items[4]
    # try to separate out the title and artist.  title should be an
    # <h3> element, artist is a <span> element
    title = title_and_artist.find('h3').text.strip()
    artist = title_and_artist.find('span').text.strip()
    # now the rest of the columns
    last_week = all_list_items[7].text.strip()
    peak_pos = all_list_items[8].text.strip()
    weeks_on_chart = all_list_items[9].text.strip()
    # we have enough to write an entry in the csv file
    csv_line = f'"{title}", "{artist}", {last_week}, {peak_pos}, {weeks_on_chart}'
    print(csv_line)
    fp.write(csv_line + '\n')


if __name__ == '__main__':
    main()

If you run:

$ chmod +x billboard_hot_100_scraper_2023.py
$ ./billboard_hot_100_scraper_2023.py

The results can be seen in the CSV file billboard_hot_100.csv:

Table 29.6.1 Billboard Hot 100
Song	Artist	Last Week	Peak Position	Weeks on Chart
A Bar Song (Tipsy)	Shaboozey	1	1	22
I Had Some Help	Post Malone Featuring Morgan Wallen	2	1	18
Espresso	Sabrina Carpenter	3	3	22
Die With A Smile	Lady Gaga & Bruno Mars	6	3	4
Birds Of A Feather	Billie Eilish	7	5	17
Taste	Sabrina Carpenter	5	2	3
Good Luck, Babe!	Chappell Roan	8	6	23
Please Please Please	Sabrina Carpenter	4	1	14
Lose Control	Teddy Swims	9	1	57
Not Like Us	Kendrick Lamar	10	1	19
Million Dollar Baby	Tommy Richman	11	2	20
Too Sweet	Hozier	12	1	25
Beautiful Things	Benson Boone	13	2	34
Ain’t No Love In Oklahoma	Luke Combs	14	13	17
Miles On It	Marshmello & Kane Brown	17	15	19
Bed Chem	Sabrina Carpenter	15	14	3
Lies Lies Lies	Morgan Wallen	19	7	10
Hot To Go!	Chappell Roan	18	16	15
Austin	Dasha	21	18	27
Cowgirls	Morgan Wallen Featuring ERNEST	16	12	39
The Emptiness Machine	Linkin Park		21	1
Pink Skies	Zach Bryan	20	6	16
I Am Not Okay	Jelly Roll	23	23	13
Kehlani	Jordan Adetunji	24	24	12
Saturn	SZA	25	6	29
Like That	Future, Metro Boomin & Kendrick Lamar	27	1	25
The Door	Teddy Swims	36	27	15
28	Zach Bryan	33	14	10
Pour Me A Drink	Post Malone Featuring Blake Shelton	26	12	12
Who	Jimin	28	12	8
Good Graces	Sabrina Carpenter	22	15	3
Slow It Down	Benson Boone	34	32	25
I Can Do It With A Broken Heart	Taylor Swift	32	3	21
TGIF	GloRilla	35	28	12
Pink Pony Club	Chappell Roan	30	26	13
Neva Play	Megan Thee Stallion & RM		36	1
Guy For That	Post Malone Featuring Luke Combs	40	17	7
Si Antes Te Hubiera Conocido	Karol G	37	32	12
Stick Season	Noah Kahan	39	9	50
Wanna Be	GloRilla & Megan Thee Stallion	38	11	23
Big Dawgs	Hanumankind X Kalmi	31	23	7
You Look Like You Love Me	Ella Langley Featuring Riley Green	41	36	12
High Road	Koe Wetzel & Jessie Murph	46	22	14
Stargazing	Myles Smith	45	40	18
Wildflower	Billie Eilish	49	17	17
360	Charli xcx	47	41	14
Sailor Song	Gigi Perez	68	47	4
Houdini	Eminem	44	2	15
Juno	Sabrina Carpenter	29	22	3
Chevrolet	Dustin Lynch Featuring Jelly Roll	56	50	13
Mamushi	Megan Thee Stallion Featuring Yuki Chiba	52	36	11
Guess	Charli xcx Featuring Billie Eilish	48	12	6
Red Wine Supernova	Chappell Roan	51	41	15
One Of Wun	Gunna	53	26	18
Help Me	Real Boston Richey	58	55	8
I Love You, I’m Sorry	Gracie Abrams	67	56	6
La Patrulla	Peso Pluma & Neton Vega	70	57	8
Whiskey Whiskey	Moneybagg Yo Featuring Morgan Wallen	55	21	13
Circadian Rhythm	Drake	69	59	2
Coincidence	Sabrina Carpenter	43	26	3
Losers	Post Malone Featuring Jelly Roll	57	25	4
No Face	Drake	60	60	2
Sharpest Tool	Sabrina Carpenter	42	21	3
Gata Only	FloyyMenor X Cris Mj	66	27	26
Lonely Road	mgk & Jelly Roll	72	33	7
Nights Like This	The Kid LAROI	64	47	12
Apple	Charli xcx	65	51	8
The Boy Is Mine	Ariana Grande	73	16	19
Think I’m In Love With You	Chris Stapleton	75	49	19
Love You, Miss You, Mean It	Luke Bryan	80	70	6
Wind Up Missin’ You	Tucker Wetmore	78	63	24
Casual	Chappell Roan	81	59	12
BAND4BAND	Central Cee & Lil Baby	71	18	16
Am I Okay?	Megan Moroney	85	74	5
Slim Pickins	Sabrina Carpenter	50	27	3
Nel	Fuerza Regida	82	73	7
Lunch	Billie Eilish	79	5	17
Si No Quieres No	Luis R Conriquez x Neton Vega	83	53	19
Belong Together	Mark Ambor	86	74	19
It’s Up	Drake, Young Thug & 21 Savage	74	28	5
Chihiro	Billie Eilish	87	12	17
Beautiful As You	Thomas Rhett	59	59	14
Don’t Smile	Sabrina Carpenter	63	35	3
Dos Dias	Tito Double P & Peso Pluma		84	1
Ruby Rosary	A$AP Rocky Featuring J. Cole		85	1
Diet Pepsi	Addison Rae		86	1
Dumb & Poetic	Sabrina Carpenter	62	32	3
Crazy	LE SSERAFIM	76	76	2
U My Everything	Sexyy Red & Drake	88	44	16
Prove It	21 Savage & Summer Walker	97	43	10
Disco	Surf Curse		91	1
Femininomenon	Chappell Roan	89	66	8
Shake Dat Ass (Twerk Song)	BossMan DLow		93	1
Nasty	Tinashe	95	61	15
Baby I’m Back	The Kid LAROI		95	1
Close To You	Gracie Abrams	96	49	7
Residuals	Chris Brown		97	2
Devil Is A Lie	Tommy Richman	90	32	13
Parking Lot	Mustard & Travis Scott	98	57	7
American Nights	Zach Bryan		21	9