28. Web scraping

[status: mostly-complete-needs-polishing-and-proofreading]

28.1. Motivation, prerequisites, plan

The web is full of information and we often browse it visually with a browser. But when we collect a scientific data set from the web we do not want to have a “human in the loop”, rather we want an automatic program to collect that data so that our results can be reproducible and our procedure can be fast and automatic.

Although my focus here is mainly on scientific applications, web scraping can also be used to mirror a web site.

Prerequisites

  • The 10-hour “serious programming” course.

  • The “Data files and first plots” mini-course in Section 2

  • You should install the program wget:

    $ sudo apt install wget
    

Plan

Our plan is to find some interesting data sets on the web.

In our first approach in Section 28.3 we will download them to our disk using the command line program wget and plot them with gnuplot. Then in Section 28.4 we will show how you can retrieve data in your python program.

Finally in Section 28.5 we will scratch the surface of all the amazing scientific data sets that can be found on the web.

We will try to look at both time history and image data. Time histories are data sets where we look at an interesting quantity as it changes in time.

Examples of time histories include temperature as a function of time (in fact, all sorts of weather and climate data) and stock market prices as a function of time.

Examples of image data include telescope images of the sky and satellite imagery of the earth and of the sun.

28.2. What does a web page look like underneath? (HTML)

To introduce students to the staples of a web page, remember:

  • Not everyone knows what HTML is.

  • Few people have seen HTML.

So we introduce HTML (hypertext markup language) by example first, and then point out what “hypertext” and “markup” mean.

So I type up a quick html page, and the students watch on the projector and type their own. The page I put up is a simple hello page at first, then I add a link.

<html>
    <head>
        <title>A simple web page</title>
    </head>

    <body>
        <h1>Mark's web page</h1>
        <p>This is Mark's web page</p>
        <p>Now a paragraph with some <i>text in italics</i>
           and some <b>text in boldface</b>
        </p>
    </body>
</html>

Save this to a file called, for example, myinfo.html in your home directory and then view it by pointing a web browser to file:///home/MYLOGINNAME/myinfo.html (yes, there are three slashes in the file URL file:///...).

That simple web page lets me explain what I mean by markup: bits of text like <p> and <i> and <head> are not text in the document: they specify how the document should be rendered (for example <b> and <i> specify how the text should look, <p> breaks the text into paragraphs). Some of the tags don’t affect the text at all, but tell us how the document should be understood (for example the metadata tags <html> and <title>).

Then let’s add a hyperlink: a link to the student’s school. My html page now looks like:

Listing 28.2.1 A simple web page with an anchor (hyperlink) element in it.
<html>
    <head>
        <title>A simple web page</title>
    </head>

    <body>
        <h1>Mark's web page</h1>
        <p>This is Mark's web page</p>
        <p>Now a paragraph with some <i>text in italics</i>
           and some <b>text in boldface</b>
        </p>
        <p>Mark went to high school at
           <a href="http://liceoparini.gov.it/">Liceo Parini</a>
        </p>
    </body>
</html>

Then save and reload the page in your browser.

Here I’ve introduced the hyperlink. In HTML this is made up of an element called <a> (anchor) which has an attribute called href which has the URL of the hyperlink.

So as we write programs that pick apart a web page we now know what web pages look like. If we want to find the links in a web page we can use the Python string find() method to look for <a and then for </a> and to use the text in between the two.

28.3. Command line scraping with wget

In Section 8.1 we had our first glimpse of the command wget, a wonderful program which grabs a page from the web and puts the result into a file on your disk. This type of program is sometimes called a “web crawler” or “offline browser”.

wget can even follow links up to a certain depth and reproduce the web hierarchy on a local disk.

In areas with poor network connectivity people can use wget when there is a brief moment of good newtorking: they download all they need in a hurry, then point their browser to the data on their local disk.

28.3.1. First download with wget

Let us make a directory in which to work and start getting data.

$ mkdir scraping
$ cd scraping
$ wget https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv

We now have a file called drinks.csv - how do we explore it?

I would first use simple file tools:

less drinks.csv

shows lines like this:

country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
Afghanistan,0,0,0,0.0
Albania,89,132,54,4.9
Algeria,25,0,14,0.7
Andorra,245,138,312,12.4
Angola,217,57,45,5.9
## ...

If you like to see data in a spreadsheet you could try to use libreoffice or gnumeric:

libreoffice drinks.csv

28.3.2. Simple analysis of the drinks.csv file

Sometimes you can learn quite a bit about what’s in a file with simple shell tools, without using a plotting program or writing a data analysis program. I will show you a some things you can do with one line shell commands.

Looking at drinks.csv we see that the fourth column is the number of wine servings per capita drunk in that country. Let us use the command sort to order the file by wine consumption.

A quick look at the sort documentation with man sort shows us that the -t option can be used to use a comma instead of white space to separate fields. We also find out that the -k option can be used to specify a key and -g to sort numerically (including floating point). Put these together to try running:

sort -t , -k 4 -g drinks.csv

this will show you all those countries in order of increasing wine consumption, rather than in alphabetical order. To see just the last few 15 lines you can run:

sort -t , -k 4 -g drinks.csv | tail -15

This is a great opportunity to laugh at the confirmation of some stereotypes and the negation of others.

If you look at the last few lines you see that the French consume the most wine per capita, followed by the Portuguese.

If you sort by the 5th column you will see the overall use of alcohol and the 3rd column will show you the use of spirits (hard liquor) while the 2nd column shows consumption of beer.

28.3.3. Looking at birth data

$ wget https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv
$ tr '\r' '\n' < US_births_2000-2014_SSA.csv > births_2000-2014_SSA-newline.csv
$ gnuplot
gnuplot> set datafile separator ","
gnuplot> plot 'births_2000-2014_SSA-newline.csv' using 5 with lines

28.4. Scraping from a Python program

28.4.1. Brief interlude on string manipulation

$ python3
>>> s = 'now is the time for all good folk to come to the aid of the party'
>>> s.split()
['now', 'is', 'the', 'time', 'for', 'all', 'good', 'folk', 'to', 'come', 'to', 'the', 'aid', 'of', 'the', 'party']
# now we've seen what that looks like, save it into a variable
>>> words = s.split()
>>> words
['now', 'is', 'the', 'time', 'for', 'all', 'good', 'folk', 'to', 'come', 'to', 'the', 'aid', 'of', 'the', 'party']
>>>
# now try to split where the separator is a comma
>>> csv_str = 'name,age,height'
>>> words = csv_str.split()
>>> csv_str = 'name,age,height'
>>> words = csv_str.split()
>>> words
['name,age,height']
# didn't work; try telling split() to use a comma
>>> words = csv_str.split(',')
>>> words
['name', 'age', 'height']

28.4.2. The birth data from Python

Listing 28.4.1 get-birth-data.py - A program which downloads birth data.
#! /usr/bin/env python3

import urllib.request

day_map = {1: 'mon', 2: 'tue', 3: 'wed', 4: 'thu', 5: 'fri', 
           6: 'sat', 7: 'sun'}

def main():
    f = urllib.request.urlopen('https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv')
    ## this file has carriage returns instead of newlines, so
    ## f.readlines() won't work in all cases.  I read the whole
    ## file in, and then split it into lines
    entire_file = f.read()
    f.close()
    lines = entire_file.split()
    print('lines:', lines[:3])
    dataset = []
    for line in lines[1:]:
        # print('line:', line, str(line))
        line = line.decode('utf-8')
        words = line.split(',')
        # print(words)
        values = [int(w) for w in words]
        dataset.append(values)
    day_of_week_hist = process_dataset(dataset)
    print_histogram(day_of_week_hist)

def process_dataset(dataset):
    ## NOTE: the fields are:
    ## year,month,date_of_month,day_of_week,births
    print('dataset has %d lines' % len(dataset))
    ## now we form a histogram of births according to the day of the
    ## week
    day_of_week_hist = {}
    for i in range(1, 8):
        day_of_week_hist[i] = 0
    for row in dataset:
        day_of_week = row[3]
        month = row[1]
        n_births = row[4]
        day_of_week_hist[day_of_week] += n_births
    return day_of_week_hist

def print_histogram(hist):
    print(hist)
    keys = list(hist.keys())
    keys.sort()
    print('keys:', keys)
    for day in keys:
        print(day, day_map[day], hist[day])

main()

28.5. Finding neat scientific data sets

https://www.dataquest.io/blog/free-datasets-for-projects/ (they mention fivethirtyeight)

https://github.com/fivethirtyeight/data

28.5.1. Time histories

Temperature

Births

wget https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv

28.5.2. Images

NASA nebulae

Goes images of the sun

28.6. Beautiful Soup

Beautiful Soup is a powerful python package that allows you to scrape web pages in a structured manner. Unlike the code we have seen so far, which does brute-force parsing of html text chunks in Python, beautiful soup is aware of the “document object model” (DOM).

Start by installing the python package. You can probably install with pip, or on debian-based distributions you can run:

sudo apt install python3-bs4

Now enter the program in Listing 28.6.1:

Listing 28.6.1 Download the Billboard Hot 100 list using Beautiful Soup.
#! /usr/bin/env python3

"""This program was inspired by Jaimes Subroto who had written a
program that worked with the 2018 billboard html format.  Billboard
has changed its html format quite completely in 2023, so this is a
re-implementation that handles the new format.
"""

import urllib
from bs4 import BeautifulSoup as soup

def main():
    url = 'https://www.billboard.com/charts/hot-100'
    # url = 'https://web.archive.org/web/20180415100832/https://www.billboard.com/charts/hot-100/'

    # boiler plate stuff to load in an html page from its URL
    url_client = urllib.request.urlopen(url)
    page_html = url_client.read()
    url_client.close()

    # let us save it to a local html file, using utf-8 decoding so
    # that we turn the byte stream into simple ascii text
    open('page_saved.html', 'w').write(page_html.decode('utf-8'))

    # boiler plate use of beautiful soup: use the html parser on the file
    page_soup = soup(page_html, "html.parser")

    # now for the part where you need to know the structure of the
    # html file.  by inspection I found that in 2023 they use <ul>
    # list elements with the attribute "o-chart-restults-list-row", so
    # this is how you find those elements in beautiful soup:
    list_elements = page_soup.select('ul[class*=o-chart-results-list-row]') # *= means contains
    # now that we have our list are ready to read things in, we also prepare 
    outfname = 'billboard_hot_100.csv'
    with open(outfname, 'w') as fp:
        headers = 'Song, Artist, Last Week, Peak Position, Weeks on Chart\n'
        fp.write(headers)
        # Loops through each list element
        for element in list_elements:
            handle_single_row(element, fp)
    print(f'\nBillboard hot 100 table saved to {outfname}')

def handle_single_row(element, fp):
    all_list_items = element.find_all('li')
    title_and_artist = all_list_items[4]
    # try to separate out the title and artist.  title should be an
    # <h3> element, artist is a <span> element
    title = title_and_artist.find('h3').text.strip()
    artist = title_and_artist.find('span').text.strip()
    # now the rest of the columns
    last_week = all_list_items[7].text.strip()
    peak_pos = all_list_items[8].text.strip()
    weeks_on_chart = all_list_items[9].text.strip()
    # we have enough to write an entry in the csv file
    csv_line = f'"{title}", "{artist}", {last_week}, {peak_pos}, {weeks_on_chart}'
    print(csv_line)
    fp.write(csv_line + '\n')


if __name__ == '__main__':
    main()

If you run:

$ chmod +x billboard_hot_100_scraper_2023.py
$ ./billboard_hot_100_scraper_2023.py

The results can be seen in the CSV file billboard_hot_100.csv:

Table 28.6.1 Billboard Hot 100

Song

Artist

Last Week

Peak Position

Weeks on Chart

Paint The Town Red

Doja Cat

2

1

8

Snooze

SZA

3

2

42

Fast Car

Luke Combs

4

2

27

Cruel Summer

Taylor Swift

6

3

21

I Remember Everything

Zach Bryan Featuring Kacey Musgraves

5

1

5

Last Night

Morgan Wallen

8

1

35

Vampire

Olivia Rodrigo

7

1

13

Fukumean

Gunna

9

4

15

Calm Down

Rema & Selena Gomez

11

3

56

Dance The Night

Dua Lipa

10

6

18

Barbie World

Nicki Minaj & Ice Spice With Aqua

12

7

14

Slime You Out

Drake Featuring SZA

1

1

2

Religiously

Bailey Zimmerman

14

13

21

Sarah’s Place

Zach Bryan Featuring Noah Kahan

14

1

Flowers

Miley Cyrus

15

1

37

Bad Idea Right?

Olivia Rodrigo

13

7

7

Thinkin’ Bout Me

Morgan Wallen

17

9

30

Agora Hills

Doja Cat

18

1

All My Life

Lil Durk Featuring J. Cole

16

2

20

Need A Favor

Jelly Roll

22

14

26

Anti-Hero

Taylor Swift

26

1

49

Used To Be Young

Miley Cyrus

23

8

5

Rich Men North Of Richmond

Oliver Anthony Music

20

1

7

Greedy

Tate McRae

33

24

2

Kill Bill

SZA

27

1

42

Boys Of Faith

Zach Bryan Featuring Bon Iver

26

1

Dial Drunk

Noah Kahan With Post Malone

34

25

15

What Was I Made For?

Billie Eilish

29

14

11

Watermelon Moonshine

Lainey Wilson

35

29

14

Creepin’

Metro Boomin, The Weeknd & 21 Savage

32

3

43

Karma

Taylor Swift Featuring Ice Spice

38

2

29

What It Is (Block Boy)

Doechii Featuring Kodak Black

43

32

21

Great Gatsby

Rod Wave

30

30

2

Get Him Back!

Olivia Rodrigo

21

11

3

I Know ?

Travis Scott

45

11

9

Good Good

Usher, Summer Walker & 21 Savage

57

36

7

Daylight

David Kushner

49

37

24

Peaches & Eggplants

Young Nudy Featuring 21 Savage

42

33

17

Try That In A Small Town

Jason Aldean

47

1

11

Lady Gaga

Peso Pluma, Gabito Ballesteros & Junior H

37

35

14

Qlona

Karol G & Peso Pluma

44

28

7

Meltdown

Travis Scott Featuring Drake

46

3

9

Love You Anyway

Luke Combs

41

15

33

Bongos

Cardi B & Megan Thee Stallion

31

14

3

Deep Satin

Zach Bryan

45

1

Boyz Don’t Cry

Rod Wave

25

25

2

Save Me

Jelly Roll With Lainey Wilson

58

47

15

Come See Me

Rod Wave

19

19

4

Single Soon

Selena Gomez

54

19

5

Call Your Friends

Rod Wave

18

18

6

Turks & Caicos

Rod Wave Featuring 21 Savage

24

24

2

Hey Driver

Zach Bryan Featuring The War And Treaty

50

14

5

Seven

Jung Kook Featuring Latto

53

1

11

Nine Ball

Zach Bryan

54

1

El Jefe

Shakira X Fuerza Regida

55

1

All-American Bitch

Olivia Rodrigo

36

13

3

White Horse

Chris Stapleton

68

31

10

Mi Ex Tenia Razon

Karol G

64

22

7

LaLa

Myke Towers

69

43

12

500lbs

Lil Tecca

60

1

Tourniquet

Zach Bryan

60

20

5

One More Time

Blink-182

62

1

Strangers

Kenya Grace

88

63

2

The Grudge

Olivia Rodrigo

52

16

3

Un Preview

Bad Bunny

65

1

Pain, Sweet, Pain

Zach Bryan

66

1

Lose Control

Teddy Swims

67

67

7

SkeeYee

Sexyy Red

74

66

4

Everything I Love

Morgan Wallen

77

14

31

Popular

The Weeknd, Playboi Carti & Madonna

72

43

17

HG4

Rod Wave

51

51

2

El Amor de Su Vida

Grupo Frontera & Grupo Firme

92

72

6

Truck Bed

HARDY

82

55

15

Lil Boo Thang

Paul Russell

99

74

2

Long Journey

Rod Wave

39

39

2

My Love Mine All Mine

Mitski

76

1

Telekinesis

Travis Scott Featuring SZA & Future

78

26

9

Tulum

Peso Pluma & Grupo Frontera

76

43

13

Sabor Fresa

Fuerza Regida

84

26

14

Spotless

Zach Bryan Featuring The Lumineers

70

17

5

Girl In Mine

Parmalee

95

81

9

Deli

Ice Spice

81

41

10

Segun Quien

Maluma & Carin Leon

83

1

Lacy

Olivia Rodrigo

59

23

3

Oh U Went

Young Thug Featuring Drake

89

19

14

Nostalgia

Rod Wave & Wet

40

40

2

Johnny Dang

That Mexican OT, Paul Wall & DRODi

91

65

11

HVN On Earth

Lil Tecca & Kodak Black

88

1

Bipolar

Peso Pluma x Jasiel Nunez x Junior H

90

60

3

In Your Love

Tyler Childers

85

43

9

Crazy

Rod Wave

48

48

2

Demons

Doja Cat

46

2

Making The Bed

Olivia Rodrigo

62

19

3

Logical

Olivia Rodrigo

63

20

3

East Side Of Sorrow

Zach Bryan

75

18

5

Standing Room Only

Tim McGraw

61

4

Checkmate

Rod Wave

55

55

2

Can’t Have Mine

Dylan Scott

98

1

On My Mama

Victoria Monet

98

2

Love Is Embarrassing

Olivia Rodrigo

65

25

3