33. Web scraping
[status: mostly-complete-needs-polishing-and-proofreading]
33.1. Motivation, prerequisites, plan
The web is full of information and we often browse it visually with a browser. But when we collect a scientific data set from the web we do not want to have a “human in the loop”, rather we want an automatic program to collect that data so that our results can be reproducible and our procedure can be fast and automatic.
Although my focus here is mainly on scientific applications, web scraping can also be used to mirror a web site.
Prerequisites
The 10-hour “serious programming” course.
The “Data files and first plots” mini-course in Section 2
You should install the program wget:
$ sudo apt install wget
Plan
Our plan is to find some interesting data sets on the web.
In our first approach in Section 33.3 we will
download them to our disk using the command line program wget
and
plot them with gnuplot. Then in
Section 33.4 we will show how you can
retrieve data in your python program.
Finally in Section 33.5 we will scratch the surface of all the amazing scientific data sets that can be found on the web.
We will try to look at both time history and image data. Time histories are data sets where we look at an interesting quantity as it changes in time.
Examples of time histories include temperature as a function of time (in fact, all sorts of weather and climate data) and stock market prices as a function of time.
Examples of image data include telescope images of the sky and satellite imagery of the earth and of the sun.
33.2. What does a web page look like underneath? (HTML)
To introduce students to the staples of a web page, remember:
Not everyone knows what HTML is.
Few people have seen HTML.
So we introduce HTML (hypertext markup language) by example first, and then point out what “hypertext” and “markup” mean.
So I type up a quick html page, and the students watch on the projector and type their own. The page I put up is a simple hello page at first, then I add a link.
<html>
<head>
<title>A simple web page</title>
</head>
<body>
<h1>Mark's web page</h1>
<p>This is Mark's web page</p>
<p>Now a paragraph with some <i>text in italics</i>
and some <b>text in boldface</b>
</p>
</body>
</html>
Save this to a file called, for example, myinfo.html
in your
home directory and then view it by pointing a web browser to
file:///home/MYLOGINNAME/myinfo.html
(yes, there are three slashes
in the file URL file:///...
).
That simple web page lets me explain what I mean by markup: bits of
text like <p>
and <i>
and <head>
are not text in the
document: they specify how the document should be rendered (for
example <b>
and <i>
specify how the text should look, <p>
breaks the text into paragraphs). Some of the tags don’t affect the
text at all, but tell us how the document should be understood (for
example the metadata tags <html>
and <title>
).
Then let’s add a hyperlink: a link to the student’s school. My html page now looks like:
<html>
<head>
<title>A simple web page</title>
</head>
<body>
<h1>Mark's web page</h1>
<p>This is Mark's web page</p>
<p>Now a paragraph with some <i>text in italics</i>
and some <b>text in boldface</b>
</p>
<p>Mark went to high school at
<a href="http://liceoparini.gov.it/">Liceo Parini</a>
</p>
</body>
</html>
Then save and reload the page in your browser.
Here I’ve introduced the hyperlink. In HTML this is made up of an
element called <a>
(anchor) which has an attribute called href
which has the URL of the hyperlink.
So as we write programs that pick apart a web page we now know what
web pages look like. If we want to find the links in a web page we
can use the Python string find()
method to look for <a
and
then for </a>
and to use the text in between the two.
33.3. Command line scraping with wget
In Section 8.1 we had our first glimpse
of the command wget
, a wonderful program which grabs a page from
the web and puts the result into a file on your disk. This type of
program is sometimes called a “web crawler” or “offline browser”.
wget can even follow links up to a certain depth and reproduce the web hierarchy on a local disk.
In areas with poor network connectivity people can use wget when there is a brief moment of good newtorking: they download all they need in a hurry, then point their browser to the data on their local disk.
33.3.1. First download with wget
Let us make a directory in which to work and start getting data.
$ mkdir scraping
$ cd scraping
$ wget https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv
We now have a file called drinks.csv
- how do we explore it?
I would first use simple file tools:
less drinks.csv
shows lines like this:
country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
Afghanistan,0,0,0,0.0
Albania,89,132,54,4.9
Algeria,25,0,14,0.7
Andorra,245,138,312,12.4
Angola,217,57,45,5.9
## ...
If you like to see data in a spreadsheet you could try to use libreoffice or gnumeric:
libreoffice drinks.csv
33.3.2. Simple analysis of the drinks.csv
file
Sometimes you can learn quite a bit about what’s in a file with simple shell tools, without using a plotting program or writing a data analysis program. I will show you a some things you can do with one line shell commands.
Looking at drinks.csv
we see that the fourth column is the number
of wine servings per capita drunk in that country. Let us use the
command sort
to order the file by wine consumption.
A quick look at the sort
documentation with man sort
shows us
that the -t
option can be used to use a comma instead of white
space to separate fields. We also find out that the -k
option can
be used to specify a key and -g
to sort numerically (including
floating point). Put these together to try running:
sort -t , -k 4 -g drinks.csv
this will show you all those countries in order of increasing wine consumption, rather than in alphabetical order. To see just the last few 15 lines you can run:
sort -t , -k 4 -g drinks.csv | tail -15
This is a great opportunity to laugh at the confirmation of some stereotypes and the negation of others.
If you look at the last few lines you see that the French consume the most wine per capita, followed by the Portuguese.
If you sort by the 5th column you will see the overall use of alcohol and the 3rd column will show you the use of spirits (hard liquor) while the 2nd column shows consumption of beer.
33.3.3. Looking at birth data
$ wget https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv
$ tr '\r' '\n' < US_births_2000-2014_SSA.csv > births_2000-2014_SSA-newline.csv
$ gnuplot
gnuplot> set datafile separator ","
gnuplot> plot 'births_2000-2014_SSA-newline.csv' using 5 with lines
33.4. Scraping from a Python program
33.4.1. Brief interlude on string manipulation
$ python3
>>> s = 'now is the time for all good folk to come to the aid of the party'
>>> s.split()
['now', 'is', 'the', 'time', 'for', 'all', 'good', 'folk', 'to', 'come', 'to', 'the', 'aid', 'of', 'the', 'party']
# now we've seen what that looks like, save it into a variable
>>> words = s.split()
>>> words
['now', 'is', 'the', 'time', 'for', 'all', 'good', 'folk', 'to', 'come', 'to', 'the', 'aid', 'of', 'the', 'party']
>>>
# now try to split where the separator is a comma
>>> csv_str = 'name,age,height'
>>> words = csv_str.split()
>>> csv_str = 'name,age,height'
>>> words = csv_str.split()
>>> words
['name,age,height']
# didn't work; try telling split() to use a comma
>>> words = csv_str.split(',')
>>> words
['name', 'age', 'height']
33.4.2. The birth data from Python
#! /usr/bin/env python3
import urllib.request
day_map = {1: 'mon', 2: 'tue', 3: 'wed', 4: 'thu', 5: 'fri',
6: 'sat', 7: 'sun'}
def main():
f = urllib.request.urlopen('https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv')
## this file has carriage returns instead of newlines, so
## f.readlines() won't work in all cases. I read the whole
## file in, and then split it into lines
entire_file = f.read()
f.close()
lines = entire_file.split()
print('lines:', lines[:3])
dataset = []
for line in lines[1:]:
# print('line:', line, str(line))
line = line.decode('utf-8')
words = line.split(',')
# print(words)
values = [int(w) for w in words]
dataset.append(values)
day_of_week_hist = process_dataset(dataset)
print_histogram(day_of_week_hist)
def process_dataset(dataset):
## NOTE: the fields are:
## year,month,date_of_month,day_of_week,births
print('dataset has %d lines' % len(dataset))
## now we form a histogram of births according to the day of the
## week
day_of_week_hist = {}
for i in range(1, 8):
day_of_week_hist[i] = 0
for row in dataset:
day_of_week = row[3]
month = row[1]
n_births = row[4]
day_of_week_hist[day_of_week] += n_births
return day_of_week_hist
def print_histogram(hist):
print(hist)
keys = list(hist.keys())
keys.sort()
print('keys:', keys)
for day in keys:
print(day, day_map[day], hist[day])
main()
33.5. Finding neat scientific data sets
https://www.dataquest.io/blog/free-datasets-for-projects/ (they mention fivethirtyeight)
https://github.com/fivethirtyeight/data
33.5.1. Time histories
Temperature
Births
wget https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv
33.5.2. Images
NASA nebulae
Goes images of the sun
33.6. Beautiful Soup
Beautiful Soup is a powerful python package that allows you to scrape web pages in a structured manner. Unlike the code we have seen so far, which does brute-force parsing of html text chunks in Python, beautiful soup is aware of the “document object model” (DOM).
Start by installing the python package. You can probably install with pip, or on debian-based distributions you can run:
sudo apt install python3-bs4
Now enter the program billboard_hot_100_scraper_2023.py in Listing 33.6.1:
#! /usr/bin/env python3
"""This program was inspired by Jaimes Subroto who had written a
program that worked with the 2018 billboard html format. Billboard
has changed its html format quite completely in 2023, so this is a
re-implementation that handles the new format.
"""
import urllib
from bs4 import BeautifulSoup as soup
def main():
url = 'https://www.billboard.com/charts/hot-100'
# url = 'https://web.archive.org/web/20180415100832/https://www.billboard.com/charts/hot-100/'
# boiler plate stuff to load in an html page from its URL
url_client = urllib.request.urlopen(url)
page_html = url_client.read()
url_client.close()
# let us save it to a local html file, using utf-8 decoding so
# that we turn the byte stream into simple ascii text
open('page_saved.html', 'w').write(page_html.decode('utf-8'))
# boiler plate use of beautiful soup: use the html parser on the file
page_soup = soup(page_html, "html.parser")
# now for the part where you need to know the structure of the
# html file. by inspection I found that in 2023 they use <ul>
# list elements with the attribute "o-chart-restults-list-row", so
# this is how you find those elements in beautiful soup:
list_elements = page_soup.select('ul[class*=o-chart-results-list-row]') # *= means contains
# now that we have our list are ready to read things in, we also prepare
outfname = 'billboard_hot_100.csv'
with open(outfname, 'w') as fp:
headers = 'Song, Artist, Last Week, Peak Position, Weeks on Chart\n'
fp.write(headers)
# Loops through each list element
for element in list_elements:
handle_single_row(element, fp)
print(f'\nBillboard hot 100 table saved to {outfname}')
def handle_single_row(element, fp):
all_list_items = element.find_all('li')
title_and_artist = all_list_items[4]
# try to separate out the title and artist. title should be an
# <h3> element, artist is a <span> element
title = title_and_artist.find('h3').text.strip()
artist = title_and_artist.find('span').text.strip()
# now the rest of the columns
last_week = all_list_items[7].text.strip()
peak_pos = all_list_items[8].text.strip()
weeks_on_chart = all_list_items[9].text.strip()
# we have enough to write an entry in the csv file
csv_line = f'"{title}", "{artist}", {last_week}, {peak_pos}, {weeks_on_chart}'
print(csv_line)
fp.write(csv_line + '\n')
if __name__ == '__main__':
main()
If you run:
$ chmod +x billboard_hot_100_scraper_2023.py
$ ./billboard_hot_100_scraper_2023.py
The results can be seen in the CSV file billboard_hot_100.csv
:
Song |
Artist |
Last Week |
Peak Position |
Weeks on Chart |
---|---|---|---|---|
A Bar Song (Tipsy) |
Shaboozey |
1 |
1 |
22 |
I Had Some Help |
Post Malone Featuring Morgan Wallen |
2 |
1 |
18 |
Espresso |
Sabrina Carpenter |
3 |
3 |
22 |
Die With A Smile |
Lady Gaga & Bruno Mars |
6 |
3 |
4 |
Birds Of A Feather |
Billie Eilish |
7 |
5 |
17 |
Taste |
Sabrina Carpenter |
5 |
2 |
3 |
Good Luck, Babe! |
Chappell Roan |
8 |
6 |
23 |
Please Please Please |
Sabrina Carpenter |
4 |
1 |
14 |
Lose Control |
Teddy Swims |
9 |
1 |
57 |
Not Like Us |
Kendrick Lamar |
10 |
1 |
19 |
Million Dollar Baby |
Tommy Richman |
11 |
2 |
20 |
Too Sweet |
Hozier |
12 |
1 |
25 |
Beautiful Things |
Benson Boone |
13 |
2 |
34 |
Ain’t No Love In Oklahoma |
Luke Combs |
14 |
13 |
17 |
Miles On It |
Marshmello & Kane Brown |
17 |
15 |
19 |
Bed Chem |
Sabrina Carpenter |
15 |
14 |
3 |
Lies Lies Lies |
Morgan Wallen |
19 |
7 |
10 |
Hot To Go! |
Chappell Roan |
18 |
16 |
15 |
Austin |
Dasha |
21 |
18 |
27 |
Cowgirls |
Morgan Wallen Featuring ERNEST |
16 |
12 |
39 |
The Emptiness Machine |
Linkin Park |
21 |
1 |
|
Pink Skies |
Zach Bryan |
20 |
6 |
16 |
I Am Not Okay |
Jelly Roll |
23 |
23 |
13 |
Kehlani |
Jordan Adetunji |
24 |
24 |
12 |
Saturn |
SZA |
25 |
6 |
29 |
Like That |
Future, Metro Boomin & Kendrick Lamar |
27 |
1 |
25 |
The Door |
Teddy Swims |
36 |
27 |
15 |
28 |
Zach Bryan |
33 |
14 |
10 |
Pour Me A Drink |
Post Malone Featuring Blake Shelton |
26 |
12 |
12 |
Who |
Jimin |
28 |
12 |
8 |
Good Graces |
Sabrina Carpenter |
22 |
15 |
3 |
Slow It Down |
Benson Boone |
34 |
32 |
25 |
I Can Do It With A Broken Heart |
Taylor Swift |
32 |
3 |
21 |
TGIF |
GloRilla |
35 |
28 |
12 |
Pink Pony Club |
Chappell Roan |
30 |
26 |
13 |
Neva Play |
Megan Thee Stallion & RM |
36 |
1 |
|
Guy For That |
Post Malone Featuring Luke Combs |
40 |
17 |
7 |
Si Antes Te Hubiera Conocido |
Karol G |
37 |
32 |
12 |
Stick Season |
Noah Kahan |
39 |
9 |
50 |
Wanna Be |
GloRilla & Megan Thee Stallion |
38 |
11 |
23 |
Big Dawgs |
Hanumankind X Kalmi |
31 |
23 |
7 |
You Look Like You Love Me |
Ella Langley Featuring Riley Green |
41 |
36 |
12 |
High Road |
Koe Wetzel & Jessie Murph |
46 |
22 |
14 |
Stargazing |
Myles Smith |
45 |
40 |
18 |
Wildflower |
Billie Eilish |
49 |
17 |
17 |
360 |
Charli xcx |
47 |
41 |
14 |
Sailor Song |
Gigi Perez |
68 |
47 |
4 |
Houdini |
Eminem |
44 |
2 |
15 |
Juno |
Sabrina Carpenter |
29 |
22 |
3 |
Chevrolet |
Dustin Lynch Featuring Jelly Roll |
56 |
50 |
13 |
Mamushi |
Megan Thee Stallion Featuring Yuki Chiba |
52 |
36 |
11 |
Guess |
Charli xcx Featuring Billie Eilish |
48 |
12 |
6 |
Red Wine Supernova |
Chappell Roan |
51 |
41 |
15 |
One Of Wun |
Gunna |
53 |
26 |
18 |
Help Me |
Real Boston Richey |
58 |
55 |
8 |
I Love You, I’m Sorry |
Gracie Abrams |
67 |
56 |
6 |
La Patrulla |
Peso Pluma & Neton Vega |
70 |
57 |
8 |
Whiskey Whiskey |
Moneybagg Yo Featuring Morgan Wallen |
55 |
21 |
13 |
Circadian Rhythm |
Drake |
69 |
59 |
2 |
Coincidence |
Sabrina Carpenter |
43 |
26 |
3 |
Losers |
Post Malone Featuring Jelly Roll |
57 |
25 |
4 |
No Face |
Drake |
60 |
60 |
2 |
Sharpest Tool |
Sabrina Carpenter |
42 |
21 |
3 |
Gata Only |
FloyyMenor X Cris Mj |
66 |
27 |
26 |
Lonely Road |
mgk & Jelly Roll |
72 |
33 |
7 |
Nights Like This |
The Kid LAROI |
64 |
47 |
12 |
Apple |
Charli xcx |
65 |
51 |
8 |
The Boy Is Mine |
Ariana Grande |
73 |
16 |
19 |
Think I’m In Love With You |
Chris Stapleton |
75 |
49 |
19 |
Love You, Miss You, Mean It |
Luke Bryan |
80 |
70 |
6 |
Wind Up Missin’ You |
Tucker Wetmore |
78 |
63 |
24 |
Casual |
Chappell Roan |
81 |
59 |
12 |
BAND4BAND |
Central Cee & Lil Baby |
71 |
18 |
16 |
Am I Okay? |
Megan Moroney |
85 |
74 |
5 |
Slim Pickins |
Sabrina Carpenter |
50 |
27 |
3 |
Nel |
Fuerza Regida |
82 |
73 |
7 |
Lunch |
Billie Eilish |
79 |
5 |
17 |
Si No Quieres No |
Luis R Conriquez x Neton Vega |
83 |
53 |
19 |
Belong Together |
Mark Ambor |
86 |
74 |
19 |
It’s Up |
Drake, Young Thug & 21 Savage |
74 |
28 |
5 |
Chihiro |
Billie Eilish |
87 |
12 |
17 |
Beautiful As You |
Thomas Rhett |
59 |
59 |
14 |
Don’t Smile |
Sabrina Carpenter |
63 |
35 |
3 |
Dos Dias |
Tito Double P & Peso Pluma |
84 |
1 |
|
Ruby Rosary |
A$AP Rocky Featuring J. Cole |
85 |
1 |
|
Diet Pepsi |
Addison Rae |
86 |
1 |
|
Dumb & Poetic |
Sabrina Carpenter |
62 |
32 |
3 |
Crazy |
LE SSERAFIM |
76 |
76 |
2 |
U My Everything |
Sexyy Red & Drake |
88 |
44 |
16 |
Prove It |
21 Savage & Summer Walker |
97 |
43 |
10 |
Disco |
Surf Curse |
91 |
1 |
|
Femininomenon |
Chappell Roan |
89 |
66 |
8 |
Shake Dat Ass (Twerk Song) |
BossMan DLow |
93 |
1 |
|
Nasty |
Tinashe |
95 |
61 |
15 |
Baby I’m Back |
The Kid LAROI |
95 |
1 |
|
Close To You |
Gracie Abrams |
96 |
49 |
7 |
Residuals |
Chris Brown |
97 |
2 |
|
Devil Is A Lie |
Tommy Richman |
90 |
32 |
13 |
Parking Lot |
Mustard & Travis Scott |
98 |
57 |
7 |
American Nights |
Zach Bryan |
21 |
9 |