5. Favorite shell techniques
5.1. Motivation
You should develop your own collection of shell techniques.
There are so many things you can do by gluing simple UNIX commands together into pipelines, and experienced hackers have their own collection of favorites.
I will present some of mine in this chapter, but remember that you should use these as a starting point for developing your own set.
Jake spent a full hour with Alien, side by side. They began with her core programming tools, customizing Emacs and GDB, her editor and debugger, and then moved on to the operating system environment itself.
Along the way he kept stopping and correcting her, half driving instructor, half drill sergeant.
“Do it right,” Jake ordered when Alien used the left arrow key to move to the beginning of a line of text. “Is that the most efficient way to do it?”
“No…,” she said, without yet knowing how to improve. “Is there a better way?”
“Control-A goes to the beginning of the line,”, Jake said. “You don’t have to hold down the arrow key.”
“Okay.” Alien tried the command, and her cursor shot over. “Got it,” she said. And then they continued.
“Back a line,” Jake tested her afterward.
“Forward to line one hundred and twenty-nine.
“Switch windows.
“Save buffers.
“Merge files.
“Compile.
“Good.”
“You’re a slave driver in a poncho,” Alien joked. But she appreciated that Jake’s instructions, especially when taken together, changed the relationship between herself and what was happening on the screen.
The more he pushed her, the faster and more elegantly she moved, until the commands at hand felt hers – extensions of Alien’s body – to be used intuitively, as the challenge at hand demanded.
This was hacking of a kind, not in the sense of breaking into something, but of moving from outsider to insider, user to ace. She was surprised to feel a similar thrill, in its own way as energizing as scaling down an elevator shaft or the Great Dome.
Finally, Alien had the keyboard shortcuts down well enough to dance across anything onscreen in seconds.
Jake looked on approvingly.
“Now you’re ready to work,” he said.
—Jeremy Smith ‘Breaking and Entering: the Extraordinary Story of a Hacker Called “alien”’
5.2. What is the shell?
The command interpreter. A layer above the operating system which allows the user to type commands.
5.2.1. Redirection and pipes
The Bourne Shell /bin/sh was part of the original UNIX system and a wonderful invention, which allowed redirection and pipes. It also had control structures so you can write extensive prorams with the shell. I do not recommend doing so: I feel that shell scripts should be brief.
The classic example of new things that people can do with the shell is pipelines. Here’s an example.
Let us say you want to find all the words in English that have the string “per” in them. You can do this with:
grep per /usr/share/dict/words
Now you want to count how many words there are:
grep per /usr/share/dict/words | wc
(this last command takes the output of grep and uses it as input to the command “wc” which counts the lines, words and chars in its input – I get approximately 1424 words when I run it)
Now let’s say you want to find words that start with the string “per” as a prefix:
grep ^per /usr/share/dict/words
grep ^per /usr/share/dict/words | wc
(I get some 423 words)
Try the same with other strings that appear in many words, like “anti” and “super”.
5.2.2. The evolution of the shell, leading up to bash
Then came the C shell /bin/csh, which was a “wrong turn”: it tried to be more C-like in its script syntax, but the language spec was confusing and sometimes ill-defined for scripting.
The shell most commonly used today is probably the Bourne Again Shell /bin/bash. bash can run shell scripts from the traditinal /bin/sh, but it adds some very nice features that can make users very fast.
I like to demonstrate using emacs keys to manipulate the command line and search and edit my history. This is a real joy for me in bash.
5.3. A basic set of shell commands you should always know
All of these commands are designed to work well with pipelines. Here I put the “classics”: programs that came with the original UNIX distribution, or soon after. There are many much more modern commands I use in my everyday life, and you will see some of them mentioned in various examples below.
- /bin/sh
The Bourne shell. You should limit your scripts strictly to using its syntax.
- grep
Search for patterns in files.
- sed
The “stream editor”: edits a stream of text in a pipeline.
- awk
awk can do anything in the world. I usually just use it for parsing out columns of data in a stream.
- cat
“Concatentates” one or more files.
- wc
Counts the lines, words, and characters in a file.
- less
A “file pager”: shows one screenful of text at a time, so you can read it.
- head, tail
Show the first few or last few lines of a file.
- wget
“Offline browser” - grabs URLs in various ways.
- man
Shows the “man page” for a command or a library function. This is a good resource to get a reference on the various command line options, and possibly on detailed semantics. On the other hand they are usually not written pedagogically.
5.4. A cookbook
To start building up your shell “bag of tricks” I will work through a series of examples that have come up for me at various times.
5.4.1. Getting tips on wordle
I mentioned the grep command earlier, and I also briefly mentioned regular expressions. These can be put together quite nicely to help with the popular word-guessing game wordle.
Note that you should not use this if you are trying to play the game; just use it to learn some shell technique. Picking one free/open-source wordle clone “hello wordl” at https://hellowordl.net/ we get something like this:
My first guess of leary
gives:
My goal now is to see if I can use the grep
command cleverly to
help me make my next guess.
First of all, let us find all the 5-letter words:
grep '^.....$' /usr/share/dict/words
This has a lot of extraneous stuff; we can improve with:
grep '^[a-z][a-z][a-z][a-z][a-z]$' /usr/share/dict/words
which will avoid funny accents and proper nouns.
Now I can use the information from my first guess - I need ‘e’, ‘r’, and ‘y’, but not in the positions you see in the figure, since they are colored yellow instead of green. We also can exclude ‘l’ and ‘a’. Start by requiring e, r, y and forbidding l and a entirely:
grep '^[a-z][a-z][a-z][a-z][a-z]$' /usr/share/dict/words | grep e | grep r | grep y | grep -v l | grep -v a | wc
This gives 31 possible words, but we can do better:
We can exclude those letters at the positions we know they are not in
we can use, for the letter ‘e’, something like grep -v '.e...'
and
the command (with its output of 18 words) becomes:
grep '^[a-z][a-z][a-z][a-z][a-z]$' /usr/share/dict/words | grep e | grep r | grep y | grep -v l | grep -v a | grep -v '.e...' | grep -v '...r.' | grep -v '....y'
buyer
coyer
dryer
foyer
fryer
greys
hyper
preys
pyres
rhyme
shyer
wryer
These should all be equally likely, but avoiding the more obscure ones (wordle uses a restricted dictionary) let us pick “fryer”:
We now know that “yer” is the end of the word, so we can add a grep for that:
grep '^[a-z][a-z][a-z][a-z][a-z]$' /usr/share/dict/words | grep e | grep r | grep y | grep -v l | grep -v a | grep -v '.e...' | grep -v '...r.' | grep -v '....y' | grep '..yer' | grep -v 'r....' | grep -v f
buyer
coyer
dryer
shyer
wryer
Three seem possible: buyer, dryer, shyer (the others seem obscure). Trying “buyer”:
which in this case was the correct guess.
5.4.2. View images with a fast and flexible diretory-based image viewer
In the “Serious programming small courses” book I introduced the concept of photo collection management [FIXME: maybe I should use photo collection management tools to this book]. Here we want to simply have an agile program to navigate through some directories to find files.
The program I like to use for this is geeqie (formerly called gqview). So let us start by installing geeqie:
sudo apt install geeqie
(question: have you ever gotten sick of apt
asking you to confirm
that you want to do the installation? have you found out the command
line option -y
for apt yet?)
Now run geeqie
on a directory where you have a bunch of images.
Note that it does a very good job of looking at absolutely huge
pictures.
I have configured some settings in geeqie
: I turned on automatic
zooming to fit the image to the window when I resize it, and I have it
show a narrow left pane with thumbnails and a big central pane with
the current image. You can also experiment with the keybindings: a
lot can be done on the keyboard, and the menus show you how. A useful
one is rapid rotation of the image with [
and ]
.
And if you don’t have a collection of images, let’s grab one! See the upcoming section on grabbing images from Wikimedia Commons.
5.4.3. Grab an entire directory of images from the web
Task: download many of the NASA “astronomy picture of the day” (APOD) images. We will stick to the dark nebulae images with this command:
mkdir -p Pictures/nebulae
cd !$
wget -r -nd -np -nc -q -A jpg https://apod.nasa.gov/apod/dark_nebulae.html
This takes a while, but the entire APOD archive would take a very long time before any .jpg files start arriving. To monitor progress open another window and do
cd Pictures/nebulae
ls -lsat
geeqie . &
and view the images. Once you have enough you can hit control-c
in the terminal with the wget command to kill it.
The options we used are: -r
tells wget to recursively download
links that are found, thus not stopping with that one web page, but
rather grabbing a real chunk of that web site. -np
tells wget to
not recurse in parent directories. -nc
is a noclobber option,
telling wget to not overwrite a file that has already been grabbed.
-nd
tells wget to not create the whole directory hierarchy, but
rather to just put the pictures in the current directory. -A jpg
tells wget to only grab .jpg files.
5.4.4. Grab an article that’s hiding
(Note: this specific example might not apply in the future, so you might have to find other “walled” articles to demonstrate the technique.)
Let us look at this article on the web:
This page did not load for me, complaining about an ad blocker. If you know wget then you can try one thing:
cd /tmp
wget https://missoulian.com/news/local/hacker-cybersecurity-ceo-shares-story-in-breaking-and-entering/article_bade060a-6b07-5be5-b2b7-fb73e446ab55.html --output-document /tmp/article.html
You will see that there is now a file called /tmp/article.html
and then view it by opening the local file on disk with your browser by going to the url:
file:///tmp/article.html
or with a text browser like lynx:
lynx /tmp/article.html
or with a graphical browser like firefox:
firerox /tmp/article.html
You should also try links, and the w3m mode in emacs.
The topic of offline browsing is a fascinating one which we already touched upon in Section 5.4.3.
5.4.5. Prepare jpeg files for printing
Task: convert many jpg files to pdf, for example for printing. Make them fill the page, and keep the aspect ratio.
For this we use the program *convert* from the impressive ImageMagick suite of graphical tools. There seems to be no end to what convert can do.
Let us use the pictures we downloaded from the NASA astronomy picture of the day (APOD) site.
First we have to tell the convert command to not worry about safety issues: we are not serving these images on the web. (Quick footnote: there is a security concern: the ImageMagick team disabled generating PDF files by default - this is unlikely to cause you any problems if you don’t make these files public.)
sed -i 's/<policy domain="coder" rights="none" pattern="PDF" \/>/<policy domain="coder" rights="read|write" pattern="PDF" \/>/g' /etc/ImageMagick-6/policy.xml
Now we can run:
cd Pictures/nebulae/
for jpg_fname in *.jpg
do
pdf_fname=`echo $jpg_fname | sed 's/jpg/pdf/'`
convert $jpg_fname -resize 1240x1750 -compose Copy -gravity center -extent 1240x1750 -units PixelsPerInch -density 150 $pdf_fname
echo "converted $jpg_fname to $pdf_fname"
done
You can view (and then print) the resulting pdf files with your favorite pdf viewer. For example:
evince *.pdf &
5.4.6. Aliases
Create a file called .bash_aliases
in your home directory. Here
are a couple to start with:
# find a files with a given name and print them
function fip()
{
find . -iname \*${1}\* -ls;
}
function fil()
{
find . -iname \*${1}\* -print;
}
# list recently modified files
function lst()
{
ls -lashtg ${1:-.} | head -13
}
Note that the .bash_aliases
file is not invoked by default when
you log in, so you should put this command early on in your
.bashrc
file (which is always invoked when you log in):
. $HOME/.bash_aliases
You can now type:
5.4.7. Discover what happened to your disk space
du ~
du -h ~
du -sh ~
du ~ | sort -n
sudo du -x / > /tmp/du.out &
sort -n /tmp/du.out ## repeatedly while the data accumulates
There is also a graphical program called baobab
which probes disk
usage. I find it more useful to use the du/sort pipeline, also
because you can then start throwing grep in to it.
5.4.8. Splitting audio files
Task: download an audiobook, convert it to mp3, and split it in to well-named mp3 files, each of which lasts about 3 minutes.
Let us start with an audiobook that’s on youtube that is licensed in a manner clear enough for us to download it. Jules Verne’s Mysterious Island can be found at: https://www.youtube.com/watch?v=h_SYtFmypmc
Download it with:
yt-dlp --extract-audio --audio-format mp3 --audio-quality 0 'https://www.youtube.com/watch?v=h_SYtFmypmc'
The file we end up with is called something like The Mysterious
Island Part 1 by Jules VERNE - AudioBook, Summary BAC,
Biography-h_SYtFmypmc.mp3
which clearly will not do, so we rename it
with
mv 'The Mysterious Island Part 1 by Jules VERNE - AudioBook, Summary BAC, Biography-h_SYtFmypmc.mp3' Jules-Verne_The-Mysterious-Island-Part-1.mp3
Note that in a classroom setting we should choose a shorter file that downloads more quickly, such as this: https://www.youtube.com/watch?v=EPhQAphrQe0
yt-dlp --extract-audio --audio-format mp3 --audio-quality 0 'https://www.youtube.com/watch?v=EPhQAphrQe0'
We rename it with
mv 'Ali Baba and the Forty Thieves - Audiobook-EPhQAphrQe0.mp3' Ali-Baba-and-the-Forty-Thieves.mp3
This file is about 54 minutes long, so we split it into 18 parts. The procedure is:
cp Ali-Baba-and-the-Forty-Thieves.mp3 /tmp/
cd /tmp
mkdir ali-baba
cd ali-baba
split --suffix-length 3 --additional-suffix=.mp3 -d --bytes 3M ../Ali-Baba-and-the-Forty-Thieves.mp3 ali-baba-
If we list the directory we now find that there are 18 mp3 files:
$ ls -sh
total 63M
3.0M ali-baba-000.mp3 3.0M ali-baba-007.mp3 3.0M ali-baba-014.mp3
3.0M ali-baba-001.mp3 3.0M ali-baba-008.mp3 3.0M ali-baba-015.mp3
3.0M ali-baba-002.mp3 3.0M ali-baba-009.mp3 3.0M ali-baba-016.mp3
3.0M ali-baba-003.mp3 3.0M ali-baba-010.mp3 3.0M ali-baba-017.mp3
3.0M ali-baba-004.mp3 3.0M ali-baba-011.mp3 3.0M ali-baba-018.mp3
3.0M ali-baba-005.mp3 3.0M ali-baba-012.mp3 3.0M ali-baba-019.mp3
3.0M ali-baba-006.mp3 3.0M ali-baba-013.mp3 2.1M ali-baba-020.mp3
$
these are cleanly numbered in growing order. These can be put on an mp3 player or a cell phone to be played while walking or in a car. If you lose your place you can find it easily in a 3-minute track, while it is harder to do so in a file that is an hour or ten hours long.
They can also be burned onto a CD:
sudo mp3burn -o "-v speed=2 dev=/dev/cdrom" ali-baba*.mp3
5.4.9. “Just use sed and awk”
I never became an expert at sed and awk, but I did learn a few simple patterns. These are just the simplest things you can do: there’s so much more, but you can quickly learn to remember these ones.
Use sed (the “stream editor”) to substitute bits of text as it goes by.
FIXME: put example here
I use awk (named after its authors legendary Bell Labs computer scientists Aho, Weinberger and Kernighan) to select columns in a stream of text.
FIXME: put example here
5.5. A smattering of regular expressions
The subject of regular expressions is a vast one. I am not an expert, but even as a non-expert I keep a few “up my sleeve” for use in the shell.
This section is incomplete, but it should start with matching start
and end of a line with ^
and $
. Then it should mention .*
to match ranges of anything. Then include at least a couple of
complex matches and a couple of replacements with the \1
type of
mechanism. [FIXME: complete this section]
5.6. Longer pipelines
Here are a couple of examples of longer pipelines.
5.6.1. Asking questions about a text file
Sociologist and demographer Nancy Howell collected data on the noted Dobe !Kung tribe of the Kalahari desert, sometimes known as the “Bushmen”. The 538 blog has a collection of data sets, including her data on age, height, and weight data for the !Kung.
Download the Howell file with data from the bushmen:
wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
the top of the file looks like:
$ head Howell1.csv
"height";"weight";"age";"male"
151.765;47.8256065;63;1
139.7;36.4858065;63;0
136.525;31.864838;65;0
156.845;53.0419145;41;1
145.415;41.276872;51;0
163.83;62.992589;35;1
149.225;38.2434755;32;0
168.91;55.4799715;27;1
147.955;34.869885;19;0
Can I look at that file a bit better?
cat Howell1.csv | sed 's/;/ /g'
cat Howell1.csv | sed 's/;/ /g' | less
# How many lines?
cat Howell1.csv | wc -l
# How many people?
cat Howell1.csv | grep -v height | wc -l
# Who are the tallest 5 people?
cat Howell1.csv | grep -v height | sed 's/;/ /g' | sort -n -k 1 | tail -5
# Who are the oldest 5 people?
cat Howell1.csv | grep -v height | sed 's/;/ /g' | sort -n -k 3 | tail -5
# How many men?
cat Howell1.csv | grep '1$' | wc -l
# How many women?
cat Howell1.csv | grep '0$' | wc -l
# What is the average age?
cat Howell1.csv | grep -v height | sed 's/;/ /g' | awk '{sum+=$3} END {print "AVG =",sum/NR}'
5.6.2. Anatomy of a web scraping pipeline
wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump
wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | less
wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | grep '. ' | grep '^ *[0-9]'
wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | grep '. ' | grep '^ *[0-9]' | grep -v http
wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | grep '. ' | grep '^ *[0-9]' | grep -v http | grep -v file: | grep -v about:
wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | grep '. ' | grep '^ *[0-9]' | grep -v http | grep -v file: | grep -v about: | grep '[0-9]\.'
NAMES=`wget -O - https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 | lynx -stdin --dump | grep '. ' | grep '^ *[0-9]' | grep -v http | grep -v file: | grep -v about: | grep '[0-9]\.'`
echo $NAMES
## now you can go to town on this
5.7. X window system techniques
OK, this section doesn’t really fit in this chapter because it doesn’t really involve the shell much. In here I should mention things like the DISPLAY and ssh and remove execution and running programs from the command line.
5.8. Curl and wget recipes
curl and wget are “offline web browsers”: commands which grab stuff from a web site unattended: you just give the command, go away for a while, and come back to find your data downloaded. No mouse clicking.
Although they do similar stuff, the design of command line options is
different. wget is set up to make easy single commands that mirror
part of a web site hierarchy. curl is set up so that curl URL
prints that URL to standard output, which makes it very good for shell
pipelines.
5.8.1. Geographical data and JSON format
From https://www.tecmint.com/find-linux-server-geographic-location/ I got a cute way of putting together curl and jq.
What is JSON? Short for JavaScript Object Notation, JSON has taken the world of web programming by storm. The idea is to convert a chunk of data from any language into its javascript representation, pass it around the web, and then unpack it back into another language. Python has JSON libraries, as do most other languages now.
A bit of JSON representing the geographical address of my computer’s IP address, for example, is:
{
"status": "success",
"data": {
"ipv4": "76.18.79.104",
"continent_name": "North America",
"country_name": "United States",
"subdivision_1_name": "New Mexico",
"subdivision_2_name": null,
"city_name": "Albuquerque",
"latitude": "35.14040",
"longitude": "-106.48770"
}
}
Note
This is not actually correct: my address is in Santa Fe, but my current internet service provider runs traffic through Alburquerque, so the automated ways of identifying the IP address’s geography don’t work.
The trick here is to look at three separate instructions. The first two use web services:
curl -s https://ipinfo.io/ip
will get your IP address. The next:
curl -s https://ipvigilante.com/YOUR_IP_ADDRESS
will make a guess as to where you are located.
Let’s start by putting these two together:
~ $ curl -s https://ipinfo.io/ip
76.18.79.104
~ $ curl -s https://ipvigilante.com/76.18.79.104
{"status":"success","data":{"ipv4":"76.18.79.104","continent_name":"North America","country_name":"United States","subdivision_1_name":"New Mexico","subdivision_2_name":null,"city_name":"Albuquerque","latitude":"35.14040","longitude":"-106.48770"}}~ $
## put those two together
~ $ curl -s https://ipvigilante.com/`curl -s https://ipinfo.io/ip`
{"status":"success","data":{"ipv4":"76.18.79.104","continent_name":"North America","country_name":"United States","subdivision_1_name":"New Mexico","subdivision_2_name":null,"city_name":"Albuquerque","latitude":"35.14040","longitude":"-106.48770"}}
Notice how the result of the ipvigilante.com
queries is javascript
but it all runs on the same line, and is hard to read. There is a
program jq
which is a JSON filter that can do a few tricks rather
easily. If you just run it with no arguments it pretty-prints the
JSON code:
~ $ curl -s https://ipvigilante.com/`curl -s https://ipinfo.io/ip` | jq
{
"status": "success",
"data": {
"ipv4": "76.18.79.104",
"continent_name": "North America",
"country_name": "United States",
"subdivision_1_name": "New Mexico",
"subdivision_2_name": null,
"city_name": "Albuquerque",
"latitude": "35.14040",
"longitude": "-106.48770"
}
}
Finally: we might want to just print basic geographical data, not that
whole list, so here are options to jq to print what we want. Note
that I’m showing here a different way than the backtick \`
to
substitute a command output: the $(command args...)
approach.
~ $ curl -s https://ipvigilante.com/$(curl -s https://ipinfo.io/ip) | jq '.data.latitude, .data.longitude, .data.city_name, .data.country_name'
"35.14040"
"-106.48770"
"Albuquerque"
"United States"
5.8.2. Pipeline with find and grep
I would like to run grep to find strings in a whole collection of files, let us say all the files under my home directory (recursively) that end in ‘.py’. For example, you might want to find all the python programs you have in which
Since you might not have a bunch of .txt files, here are some recipes to download a bunch of text files:
$ rsync -avm --include '*/' --include '*.txt' --exclude '*' --del ftp.ibiblio.org::gutenberg $HOME/gutenberg
This will create a huge number of files in your ~/gutenberg directory. You will need to interrupt it with control-c at some point. Mine is still running, and the command:
$ find ~/gutenberg/ -name '*.txt' | wc
tells me I have 8740.