:tocdepth: 2 .. _chap-web-scraping: ============== Web scraping ============== [status: mostly-complete-needs-polishing-and-proofreading] Motivation, prerequisites, plan =============================== The web is full of information and we often browse it visually with a browser. But when we collect a scientific data set from the web we do not want to have a "human in the loop", rather we want an automatic program to collect that data so that our results can be reproducible and our procedure can be fast and automatic. Although my focus here is mainly on scientific applications, web scraping can also be used to mirror a web site. .. rubric:: Prerequisites * The 10-hour "serious programming" course. * The "Data files and first plots" mini-course in :numref:`chap-data-files-and-first-plots` * You should install the program wget: .. code:: console $ sudo apt install wget .. rubric:: Plan Our plan is to find some interesting data sets on the web. In our first approach in :numref:`sec-command-line-scraping` we will download them to our disk using the command line program ``wget`` and plot them with gnuplot. Then in :numref:`sec-scraping-from-a-python-program` we will show how you can retrieve data in your python program. Finally in :numref:`sec-finding-neat-scientific-data-sets` we will scratch the surface of all the amazing scientific data sets that can be found on the web. We will try to look at both *time history* and *image* data. Time histories are data sets where we look at an interesting quantity as it changes in time. Examples of time histories include temperature as a function of time (in fact, all sorts of weather and climate data) and stock market prices as a function of time. Examples of image data include telescope images of the sky and satellite imagery of the earth and of the sun. .. _sec-what-does-a-web-page-look-like-underneath: What does a web page look like underneath? (HTML) ================================================= To introduce students to the staples of a web page, remember: * Not everyone knows what HTML is. * Few people have seen HTML. So we introduce HTML (hypertext markup language) by example first, and then point out what "hypertext" and "markup" mean. So I type up a quick html page, and the students watch on the projector and type their own. The page I put up is a simple hello page at first, then I add a link. .. _listing-simple-web-page: .. code-block:: html A simple web page

Mark's web page

This is Mark's web page

Now a paragraph with some text in italics and some text in boldface

Save this to a file called, for example, :file:`myinfo.html` in your home directory and then view it by pointing a web browser to ``file:///home/MYLOGINNAME/myinfo.html`` (yes, there are three slashes in the file URL ``file:///...``). That simple web page lets me explain what I mean by *markup*: bits of text like ``

`` and ```` and ```` are not text in the document: they specify how the document should be rendered (for example ```` and ```` specify how the text should look, ``

`` breaks the text into paragraphs). Some of the tags don't affect the text at all, but tell us how the document should be understood (for example the *metadata* tags ```` and ````). Then let's add a hyperlink: a link to the student's school. My html page now looks like: .. _listing-simple-web-page-with-anchor: .. code-block:: html :caption: A simple web page with an anchor (hyperlink) element in it. <html> <head> <title>A simple web page

Mark's web page

This is Mark's web page

Now a paragraph with some text in italics and some text in boldface

Mark went to high school at Liceo Parini
Then save and reload the page in your browser. Here I've introduced the *hyperlink*. In HTML this is made up of an element called ```` (anchor) which has an attribute called ``href`` which has the URL of the hyperlink. So as we write programs that pick apart a web page we now know what web pages look like. If we want to find the links in a web page we can use the Python string ``find()`` method to look for ```` and to use the text in between the two. .. _sec-command-line-scraping: Command line scraping with ``wget`` =================================== In :numref:`sec-population-data-from-the-web` we had our first glimpse of the command ``wget``, a wonderful program which grabs a page from the web and puts the result into a file on your disk. This type of program is sometimes called a "web crawler" or "offline browser". wget can even follow links up to a certain depth and reproduce the web hierarchy on a local disk. In areas with poor network connectivity people can use wget when there is a brief moment of good newtorking: they download all they need in a hurry, then point their browser to the data on their local disk. First download with wget ------------------------ Let us make a directory in which to work and start getting data. .. code:: console $ mkdir scraping $ cd scraping $ wget https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv We now have a file called ``drinks.csv`` - how do we explore it? I would first use simple file tools: ``less drinks.csv`` shows lines like this: :: country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol Afghanistan,0,0,0,0.0 Albania,89,132,54,4.9 Algeria,25,0,14,0.7 Andorra,245,138,312,12.4 Angola,217,57,45,5.9 ## ... If you like to see data in a spreadsheet you could try to use libreoffice or gnumeric: ``libreoffice drinks.csv`` Simple analysis of the ``drinks.csv`` file ------------------------------------------ Sometimes you can learn quite a bit about what's in a file with simple shell tools, without using a plotting program or writing a data analysis program. I will show you a some things you can do with one line shell commands. Looking at ``drinks.csv`` we see that the fourth column is the number of wine servings per capita drunk in that country. Let us use the command ``sort`` to order the file by wine consumption. A quick look at the ``sort`` documentation with ``man sort`` shows us that the ``-t`` option can be used to use a comma instead of white space to separate fields. We also find out that the ``-k`` option can be used to specify a key and ``-g`` to sort numerically (including floating point). Put these together to try running: .. code-block:: console sort -t , -k 4 -g drinks.csv this will show you all those countries in order of increasing wine consumption, rather than in alphabetical order. To see just the last few 15 lines you can run: .. code-block:: console sort -t , -k 4 -g drinks.csv | tail -15 This is a great opportunity to laugh at the confirmation of some stereotypes and the negation of others. If you look at the last few lines you see that the French consume the most wine per capita, followed by the Portuguese. If you sort by the 5th column you will see the overall use of alcohol and the 3rd column will show you the use of spirits (hard liquor) while the 2nd column shows consumption of beer. Looking at birth data --------------------- .. code:: console $ wget https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv $ tr '\r' '\n' < US_births_2000-2014_SSA.csv > births_2000-2014_SSA-newline.csv $ gnuplot gnuplot> set datafile separator "," gnuplot> plot 'births_2000-2014_SSA-newline.csv' using 5 with lines .. _sec-scraping-from-a-python-program: Scraping from a Python program ============================== .. _sec-brief-interlude-on-string-manipulation: Brief interlude on string manipulation -------------------------------------- .. code-block:: pycon $ python3 >>> s = 'now is the time for all good folk to come to the aid of the party' >>> s.split() ['now', 'is', 'the', 'time', 'for', 'all', 'good', 'folk', 'to', 'come', 'to', 'the', 'aid', 'of', 'the', 'party'] # now we've seen what that looks like, save it into a variable >>> words = s.split() >>> words ['now', 'is', 'the', 'time', 'for', 'all', 'good', 'folk', 'to', 'come', 'to', 'the', 'aid', 'of', 'the', 'party'] >>> # now try to split where the separator is a comma >>> csv_str = 'name,age,height' >>> words = csv_str.split() >>> csv_str = 'name,age,height' >>> words = csv_str.split() >>> words ['name,age,height'] # didn't work; try telling split() to use a comma >>> words = csv_str.split(',') >>> words ['name', 'age', 'height'] .. _sec-the-birth-data-from-python: The birth data from Python -------------------------- .. literalinclude:: get-birth-data.py :language: python :caption: get-birth-data.py - A program which downloads birth data. .. _sec-finding-neat-scientific-data-sets: Finding neat scientific data sets ================================= https://www.dataquest.io/blog/free-datasets-for-projects/ (they mention fivethirtyeight) https://github.com/fivethirtyeight/data Time histories -------------- Temperature Births wget https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv Images ------ NASA nebulae Goes images of the sun Beautiful Soup ============== Beautiful Soup is a powerful python package that allows you to scrape web pages in a *structured* manner. Unlike the code we have seen so far, which does brute-force parsing of html text chunks in Python, beautiful soup is aware of the "document object model" (DOM). Start by installing the python package. You can probably install with pip, or on debian-based distributions you can run: .. code-block:: console sudo apt install python3-bs4 Now enter the program `billboard_hot_100_scraper_2023.py` in :numref:`listing-billboard-hot-100-py`: .. _listing-billboard-hot-100-py: .. literalinclude:: billboard_hot_100_scraper_2023.py :language: python :caption: Download the Billboard Hot 100 list using Beautiful Soup. If you run: .. code-block:: console $ chmod +x billboard_hot_100_scraper_2023.py $ ./billboard_hot_100_scraper_2023.py The results can be seen in the CSV file ``billboard_hot_100.csv``: .. csv-table:: Billboard Hot 100 :file: billboard_hot_100.csv :widths: auto :header-rows: 1