5. Favorite shell techniques

5.1. Motivation

You should develop your own collection of shell techniques.

There are so many things you can do by gluing simple UNIX commands together into pipelines, and experienced hackers have their own collection of favorites.

I will present some of mine in this chapter, but remember that you should use these as a starting point for developing your own set.

Jake spent a full hour with Alien, side by side. They began with her core programming tools, customizing Emacs and GDB, her editor and debugger, and then moved on to the operating system environment itself.

../_images/smith_breaking-and-entering_cvr_hi-res-600.jpg

Along the way he kept stopping and correcting her, half driving instructor, half drill sergeant.

“Do it right,” Jake ordered when Alien used the left arrow key to move to the beginning of a line of text. “Is that the most efficient way to do it?”

“No…,” she said, without yet knowing how to improve. “Is there a better way?”

“Control-A goes to the beginning of the line,”, Jake said. “You don’t have to hold down the arrow key.”

“Okay.” Alien tried the command, and her cursor shot over. “Got it,” she said. And then they continued.

“Back a line,” Jake tested her afterward.

“Forward to line one hundred and twenty-nine.

“Switch windows.

“Save buffers.

“Merge files.

“Compile.

“Good.”

“You’re a slave driver in a poncho,” Alien joked. But she appreciated that Jake’s instructions, especially when taken together, changed the relationship between herself and what was happening on the screen.

The more he pushed her, the faster and more elegantly she moved, until the commands at hand felt hers – extensions of Alien’s body – to be used intuitively, as the challenge at hand demanded.

This was hacking of a kind, not in the sense of breaking into something, but of moving from outsider to insider, user to ace. She was surprised to feel a similar thrill, in its own way as energizing as scaling down an elevator shaft or the Great Dome.

Finally, Alien had the keyboard shortcuts down well enough to dance across anything onscreen in seconds.

Jake looked on approvingly.

Now you’re ready to work,” he said.

—Jeremy Smith ‘Breaking and Entering: the Extraordinary Story of a Hacker Called “alien”’

5.2. What is the shell?

The command interpreter. A layer above the operating system which allows the user to type commands.

5.2.1. Redirection and pipes

The Bourne Shell /bin/sh was part of the original UNIX system and a wonderful invention, which allowed redirection and pipes. It also had control structures so you can write extensive prorams with the shell. I do not recommend doing so: I feel that shell scripts should be brief.

The classic example of new things that people can do with the shell is pipelines. Here’s an example.

Let us say you want to find all the words in English that have the string “per” in them. You can do this with:

grep per /usr/share/dict/words

Now you want to count how many words there are:

grep per /usr/share/dict/words | wc

(this last command takes the output of grep and uses it as input to the command “wc” which counts the lines, words and chars in its input – I get approximately 1424 words when I run it)

Now let’s say you want to find words that start with the string “per” as a prefix:

grep ^per /usr/share/dict/words
grep ^per /usr/share/dict/words | wc

(I get some 423 words)

Try the same with other strings that appear in many words, like “anti” and “super”.

5.2.2. The evolution of the shell, leading up to bash

Then came the C shell /bin/csh, which was a “wrong turn”: it tried to be more C-like in its script syntax, but the language spec was confusing and sometimes ill-defined for scripting.

The shell most commonly used today is probably the Bourne Again Shell /bin/bash. bash can run shell scripts from the traditinal /bin/sh, but it adds some very nice features that can make users very fast.

I like to demonstrate using emacs keys to manipulate the command line and search and edit my history. This is a real joy for me in bash.

5.3. A basic set of shell commands you should always know

All of these commands are designed to work well with pipelines. Here I put the “classics”: programs that came with the original UNIX distribution, or soon after. There are many much more modern commands I use in my everyday life, and you will see some of them mentioned in various examples below.

/bin/sh

The Bourne shell. You should limit your scripts strictly to using its syntax.

grep

Search for patterns in files.

sed

The “stream editor”: edits a stream of text in a pipeline.

awk

awk can do anything in the world. I usually just use it for parsing out columns of data in a stream.

cat

“Concatentates” one or more files.

wc

Counts the lines, words, and characters in a file.

less

A “file pager”: shows one screenful of text at a time, so you can read it.

head, tail

Show the first few or last few lines of a file.

man

Shows the “man page” for a command or a library function. This is a good resource to get a reference on the various command line options, and possibly on detailed semantics. On the other hand they are usually not written pedagogically.

5.4. A cookbook

To start building up your shell “bag of tricks” I will work through a series of examples that have come up for me at various times.

5.4.1. Getting tips on wordle

I mentioned the grep command earlier, and I also briefly mentioned regular expressions. These can be put together quite nicely to help with the popular word-guessing game wordle.

Note that you should not use this if you are trying to play the game; just use it to learn some shell technique. Picking one free/open-source wordle clone “hello wordl” at https://hellowordl.net/ we get something like this:

../_images/hello-wordl-screenshot.png

My first guess of leary gives:

../_images/hello-wordl-first-guess.png

My goal now is to see if I can use the grep command cleverly to help me make my next guess.

First of all, let us find all the 5-letter words:

grep '^.....$' /usr/share/dict/words

This has a lot of extraneous stuff; we can improve with:

grep '^[a-z][a-z][a-z][a-z][a-z]$' /usr/share/dict/words

which will avoid funny accents and proper nouns.

Now I can use the information from my first guess - I need ‘e’, ‘r’, and ‘y’, but not in the positions you see in the figure, since they are colored yellow instead of green. We also can exclude ‘l’ and ‘a’. Start by requiring e, r, y and forbidding l and a entirely:

grep '^[a-z][a-z][a-z][a-z][a-z]$' /usr/share/dict/words | grep e | grep r | grep y | grep -v l | grep -v a | wc

This gives 31 possible words, but we can do better:

We can exclude those letters at the positions we know they are not in we can use, for the letter ‘e’, something like grep -v '.e...' and the command (with its output of 18 words) becomes:

grep '^[a-z][a-z][a-z][a-z][a-z]$' /usr/share/dict/words | grep e | grep r | grep y | grep -v l | grep -v a | grep -v '.e...' | grep -v '...r.' | grep -v '....y'
buyer
coyer
dryer
foyer
fryer
greys
hyper
preys
pyres
rhyme
shyer
wryer

These should all be equally likely, but avoiding the more obscure ones (wordle uses a restricted dictionary) let us pick “fryer”:

../_images/hello-wordl-second-guess.png

We now know that “yer” is the end of the word, so we can add a grep for that:

grep '^[a-z][a-z][a-z][a-z][a-z]$' /usr/share/dict/words | grep e | grep r | grep y | grep -v l | grep -v a | grep -v '.e...' | grep -v '...r.' | grep -v '....y' | grep '..yer' | grep -v 'r....' | grep -v f
buyer
coyer
dryer
shyer
wryer

Three seem possible: buyer, dryer, shyer (the others seem obscure). Trying “buyer”:

../_images/hello-wordl-third-guess.png

which in this case was the correct guess.

5.4.2. View images with a fast and flexible diretory-based image viewer

In the “Serious programming small courses” book I introduced the concept of photo collection management [FIXME: maybe I should use photo collection management tools to this book]. Here we want to simply have an agile program to navigate through some directories to find files.

The program I like to use for this is geeqie (formerly called gqview). So let us start by installing geeqie:

sudo apt install geeqie

(question: have you ever gotten sick of apt asking you to confirm that you want to do the installation? have you found out the command line option -y for apt yet?)

Now run geeqie on a directory where you have a bunch of images. Note that it does a very good job of looking at absolutely huge pictures.

I have configured some settings in geeqie: I turned on automatic zooming to fit the image to the window when I resize it, and I have it show a narrow left pane with thumbnails and a big central pane with the current image. You can also experiment with the keybindings: a lot can be done on the keyboard, and the menus show you how. A useful one is rapid rotation of the image with [ and ].

And if you don’t have a collection of images, let’s grab one! See the upcoming section on grabbing images from Wikimedia Commons.

5.4.3. Grab an entire directory of images from the web

Task: download many of the NASA “astronomy picture of the day” (APOD) images. We will stick to the dark nebulae images with this command:

mkdir -p Pictures/nebulae
cd !$
wget -r -nd -np -nc -q -A jpg https://apod.nasa.gov/apod/dark_nebulae.html

This takes a while, but the entire APOD archive would take a very long time before any .jpg files start arriving. To monitor progress open another window and do

cd Pictures/nebulae
ls -lsat
geeqie . &

and view the images. Once you have enough you can hit control-c in the terminal with the wget command to kill it.

The options we used are: -r tells wget to recursively download links that are found, thus not stopping with that one web page, but rather grabbing a real chunk of that web site. -np tells wget to not recurse in parent directories. -nc is a noclobber option, telling wget to not overwrite a file that has already been grabbed. -nd tells wget to not create the whole directory hierarchy, but rather to just put the pictures in the current directory. -A jpg tells wget to only grab .jpg files.

5.4.4. Grab an article that’s hiding

(Note: this specific example might not apply in the future, so you might have to find other “walled” articles to demonstrate the technique.)

This page did not load for me, complaining about an ad blocker. If you know wget then you can try one thing:

cd /tmp
wget https://missoulian.com/news/local/hacker-cybersecurity-ceo-shares-story-in-breaking-and-entering/article_bade060a-6b07-5be5-b2b7-fb73e446ab55.html --output-document /tmp/article.html

and then view it by opening the local file on disk with your browser by going to the url:

file:///tmp/article.html

or with a text browser like lynx:

lynx /tmp/article.html

or with a graphical browser like firefox:

lynx /tmp/article.html

You should also try links, and the w3m mode in emacs.

The topic of offline browsing is a fascinating one which we already touched upon in Section 5.4.3.

5.4.5. Prepare jpeg files for printing

Task: convert many jpg files to pdf, for example for printing. Make them fill the page, and keep the aspect ratio.

For this we use the program *convert* from the impressive ImageMagick suite of graphical tools. There seems to be no end to what convert can do.

Let us use the pictures we downloaded from the NASA astronomy picture of the day (APOD) site.

First we have to tell the convert command to not worry about safety issues: we are not serving these images on the web.

sed -i 's/<policy domain="coder" rights="none" pattern="PDF" \/>/<policy domain="coder" rights="read|write" pattern="PDF" \/>/g' /etc/ImageMagick-6/policy.xml

Now we can run:

cd Pictures/nebulae/
for jpg_fname in *.jpg
do
    pdf_fname=`echo $jpg_fname | sed 's/jpg/pdf/'`
    convert $jpg_fname -resize 1240x1750 -compose Copy -gravity center -extent 1240x1750 -units PixelsPerInch -density 150 $pdf_fname
    echo "converted $jpg_fname to $pdf_fname"
done

You can view (and then print) the resulting pdf files with your favorite pdf viewer. For example:

evince *.pdf &

5.4.6. Aliases

Create a file called .bash_aliases in your home directory. Here are a couple to start with:

# find a files with a given name and print them
function fip()
{
    find . -iname \*${1}\* -ls;
}
function fil()
{
    find . -iname \*${1}\* -print;
}
# list recently modified files
function lst()
{
    ls -lashtg ${1:-.} | head -13
}

5.4.7. Discover out what happened to your disk space

du ~
du -h ~
du -sh ~
du ~ | sort -n
sudo du -x / > /tmp/du.out &
sort -n /tmp/du.out ## repeatedly while the data accumulates

There is also a graphical program called baobab which probes disk usage. I find it more useful to use the du/sort pipeline, also because you can then start throwing grep in to it.

5.4.8. Splitting audio files

Task: download an audiobook, convert it to mp3, and split it in to well-named mp3 files, each of which lasts about 3 minutes.

Let us start with an audiobook that’s on youtube that is licensed in a manner clear enough for us to download it. Jules Verne’s Mysterious Island can be found at: https://www.youtube.com/watch?v=h_SYtFmypmc

Download it with:

yt-dlp --extract-audio --audio-format mp3 --audio-quality 0 'https://www.youtube.com/watch?v=h_SYtFmypmc'

The file we end up with is called something like The Mysterious Island Part 1 by Jules VERNE - AudioBook, Summary BAC, Biography-h_SYtFmypmc.mp3 which clearly will not do, so we rename it with

mv 'The Mysterious Island Part 1 by Jules VERNE - AudioBook, Summary BAC, Biography-h_SYtFmypmc.mp3' Jules-Verne_The-Mysterious-Island-Part-1.mp3

Note that in a classroom setting we should choose a shorter file that downloads more quickly, such as this: https://www.youtube.com/watch?v=EPhQAphrQe0

yt-dlp --extract-audio --audio-format mp3 --audio-quality 0 'https://www.youtube.com/watch?v=EPhQAphrQe0'

We rename it with

mv 'Ali Baba and the Forty Thieves - Audiobook-EPhQAphrQe0.mp3' Ali-Baba-and-the-Forty-Thieves.mp3

This file is about 54 minutes long, so we split it into 18 parts. The procedure is:

cp Ali-Baba-and-the-Forty-Thieves.mp3 /tmp/
cd /tmp
mkdir ali-baba
cd ali-baba
split --suffix-length 3 --additional-suffix=.mp3 -d --bytes 3M ../Ali-Baba-and-the-Forty-Thieves.mp3 ali-baba-

If we list the directory we now find that there are 18 mp3 files:

$ ls -sh
total 63M
3.0M ali-baba-000.mp3  3.0M ali-baba-007.mp3  3.0M ali-baba-014.mp3
3.0M ali-baba-001.mp3  3.0M ali-baba-008.mp3  3.0M ali-baba-015.mp3
3.0M ali-baba-002.mp3  3.0M ali-baba-009.mp3  3.0M ali-baba-016.mp3
3.0M ali-baba-003.mp3  3.0M ali-baba-010.mp3  3.0M ali-baba-017.mp3
3.0M ali-baba-004.mp3  3.0M ali-baba-011.mp3  3.0M ali-baba-018.mp3
3.0M ali-baba-005.mp3  3.0M ali-baba-012.mp3  3.0M ali-baba-019.mp3
3.0M ali-baba-006.mp3  3.0M ali-baba-013.mp3  2.1M ali-baba-020.mp3
$

these are cleanly numbered in growing order. These can be put on an mp3 player or a cell phone to be played while walking or in a car. If you lose your place you can find it easily in a 3-minute track, while it is harder to do so in a file that is an hour or ten hours long.

They can also be burned onto a CD:

sudo mp3burn -o "-v speed=2 dev=/dev/cdrom" ali-baba*.mp3

5.4.9. “Just use sed and awk”

I never became an expert at sed and awk, but I did learn a few simple patterns. These are just the simplest things you can do: there’s so much more, but you can quickly learn to remember these ones.

Use sed (the “stream editor”) to substitute bits of text as it goes by.

FIXME: put example here

I use awk (named after its authors legendary Bell Labs computer scientists Aho, Weinberger and Kernighan) to select columns in a stream of text.

FIXME: put example here

5.5. A smattering of regular expressions

The subject of regular expressions is a vast one. I am not an expert, but even as a non-expert I keep a few “up my sleeve” for use in the shell.

This section is incomplete, but it should start with matching start and end of a line with ^ and $. Then it should mention .* to match ranges of anything. Then include at least a couple of complex matches and a couple of replacements with the \1 type of mechanism. [FIXME: complete this section]

5.6. Longer pipelines

Here are a couple of examples of longer pipelines.

5.6.1. Asking questions about a text file

Sociologist and demographer Nancy Howell collected data on the noted Dobe !Kung tribe of the Kalahari desert, sometimes known as the “Bushmen”. The 538 blog has a collection of data sets, including her data on age, height, and weight data for the !Kung.

Download the Howell file with data from the bushmen:

wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv

the top of the file looks like:

$ head Howell1.csv
"height";"weight";"age";"male"
151.765;47.8256065;63;1
139.7;36.4858065;63;0
136.525;31.864838;65;0
156.845;53.0419145;41;1
145.415;41.276872;51;0
163.83;62.992589;35;1
149.225;38.2434755;32;0
168.91;55.4799715;27;1
147.955;34.869885;19;0

Can I look at that file a bit better?

cat Howell1.csv | sed 's/;/    /g'
cat Howell1.csv | sed 's/;/    /g' | less
# How many lines?
cat Howell1.csv | wc -l
# How many people?
cat Howell1.csv | grep -v height | wc -l
# Who are the tallest 5 people?
cat Howell1.csv | grep -v height | sed 's/;/    /g' | sort -n -k 1 | tail -5
# Who are the oldest 5 people?
cat Howell1.csv | grep -v height | sed 's/;/    /g' | sort -n -k 3 | tail -5
# How many men?
cat Howell1.csv | grep '1$' | wc -l
# How many women?
cat Howell1.csv | grep '0$' | wc -l
# What is the average age?
cat Howell1.csv | grep -v height | sed 's/;/ /g' | awk '{sum+=$3} END {print "AVG =",sum/NR}'

5.6.2. Anatomy of a web scraping pipeline

wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump
wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | less
wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | grep '. ' | grep '^ *[0-9]'
wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | grep '. ' | grep '^ *[0-9]' | grep -v http
wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | grep '. ' | grep '^ *[0-9]' | grep -v http | grep -v file: | grep -v about:
wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | grep '. ' | grep '^ *[0-9]' | grep -v http | grep -v file: | grep -v about: | grep '[0-9]\.'
NAMES=`wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | grep '. ' | grep '^ *[0-9]' | grep -v http | grep -v file: | grep -v about: | grep '[0-9]\.'`
echo $NAMES
## now you can go to town on this

5.7. X window system techniques

OK, this section doesn’t really fit in this chapter because it doesn’t really involve the shell much. In here I should mention things like the DISPLAY and ssh and remove execution and running programs from the command line.

5.8. Curl and wget recipes

curl and wget are “offline web browsers”: commands which grab stuff from a web site unattended: you just give the command, go away for a while, and come back to find your data downloaded. No mouse clicking.

Although they do similar stuff, the design of command line options is different. wget is set up to make easy single commands that mirror part of a web site hierarchy. curl is set up so that curl URL prints that URL to standard output, which makes it very good for shell pipelines.

5.8.1. Geographical data and JSON format

From https://www.tecmint.com/find-linux-server-geographic-location/ I got a cute way of putting together curl and jq.

What is JSON? Short for JavaScript Object Notation, JSON has taken the world of web programming by storm. The idea is to convert a chunk of data from any language into its javascript representation, pass it around the web, and then unpack it back into another language. Python has JSON libraries, as do most other languages now.

A bit of JSON representing the geographical address of my computer’s IP address, for example, is:

{
  "status": "success",
  "data": {
    "ipv4": "76.18.79.104",
    "continent_name": "North America",
    "country_name": "United States",
    "subdivision_1_name": "New Mexico",
    "subdivision_2_name": null,
    "city_name": "Albuquerque",
    "latitude": "35.14040",
    "longitude": "-106.48770"
  }
}

Note

This is not actually correct: my address is in Santa Fe, but my current internet service provider runs traffic through Alburquerque, so the automated ways of identifying the IP address’s geography don’t work.

The trick here is to look at three separate instructions. The first two use web services:

curl -s https://ipinfo.io/ip

will get your IP address. The next:

curl -s https://ipvigilante.com/YOUR_IP_ADDRESS

will make a guess as to where you are located.

Let’s start by putting these two together:

~ $ curl -s https://ipinfo.io/ip
76.18.79.104
~ $ curl -s https://ipvigilante.com/76.18.79.104
{"status":"success","data":{"ipv4":"76.18.79.104","continent_name":"North America","country_name":"United States","subdivision_1_name":"New Mexico","subdivision_2_name":null,"city_name":"Albuquerque","latitude":"35.14040","longitude":"-106.48770"}}~ $
## put those two together
~ $ curl -s https://ipvigilante.com/`curl -s https://ipinfo.io/ip`
{"status":"success","data":{"ipv4":"76.18.79.104","continent_name":"North America","country_name":"United States","subdivision_1_name":"New Mexico","subdivision_2_name":null,"city_name":"Albuquerque","latitude":"35.14040","longitude":"-106.48770"}}

Notice how the result of the ipvigilante.com queries is javascript but it all runs on the same line, and is hard to read. There is a program jq which is a JSON filter that can do a few tricks rather easily. If you just run it with no arguments it pretty-prints the JSON code:

~ $ curl -s https://ipvigilante.com/`curl -s https://ipinfo.io/ip` | jq
{
  "status": "success",
  "data": {
    "ipv4": "76.18.79.104",
    "continent_name": "North America",
    "country_name": "United States",
    "subdivision_1_name": "New Mexico",
    "subdivision_2_name": null,
    "city_name": "Albuquerque",
    "latitude": "35.14040",
    "longitude": "-106.48770"
  }
}

Finally: we might want to just print basic geographical data, not that whole list, so here are options to jq to print what we want. Note that I’m showing here a different way than the backtick \` to substitute a command output: the $(command args...) approach.

~ $ curl -s https://ipvigilante.com/$(curl -s https://ipinfo.io/ip) | jq '.data.latitude, .data.longitude, .data.city_name, .data.country_name'
"35.14040"
"-106.48770"
"Albuquerque"
"United States"

5.8.2. Pipeline with find and grep

I would like to run grep to find strings in a whole collection of files, let us say all the files under my home directory (recursively) that end in ‘.py’. For example, you might want to find all the python programs you have in which

Since you might not have a bunch of .txt files, here are some recipes to download a bunch of text files:

$ rsync -avm --include '*/' --include '*.txt' --exclude '*' --del ftp.ibiblio.org::gutenberg $HOME/gutenberg

This will create a huge number of files in your ~/gutenberg directory. You will need to interrupt it with control-c at some point. Mine is still running, and the command:

$ find ~/gutenberg/ -name '*.txt' | wc

tells me I have 8740.