13. Text manipulation

bb|[^b]{2}

13.1. Review

You’ve come a long way already. You know how to use the terminal to start and stop processes. You’re aware of how to operate with files, list, move, copy and view them. You can redirect command output to files. You know how to create chains of commands, feeding the output of one as the input of the other.

13.1.1. Test questions

  1. Give two ways how to view the content of a text file.
  2. How can you link two commands together, so that the second uses the first one as input? How does this relate to the input-output model?
  3. How can you automatize a series of commands?
  4. Name a way to get rid of unwanted command output.
  5. List all files in your home directory ending with ‘.doc’

13.1.2. Introduction

One of the advantages of using a text-based environment is that it’s simple. Much simpler than graphical output. Maybe not for the human in front of the machine, but for the computer. There’s a reason why text-based systems evolved before graphical ones! Text means less data and more structure - it’s very hard for the computer to organize an image and almost impossible to make any sense of its content. With text, this is more easy. The computer still cannot understand the meaning of the text (how you feel when saying “I love you”) but it can process it very efficiently. On an image, you cannot tell the computer “Look for all cups”. But it will easily find all occurences of the word “cups” in any file on your disk.

13.2. Searching

In fact, we have some tools which do exactly that.

grep 'ssh' /etc/services

The grep command searches for a string (ssh) in a file (/etc/services) and prints you all matching lines. The quotes are optional but a good practice. You’ll need them in any case if you’re search text includes whitespaces.

_images/grep_services_ssh.png

This example was very simple, trivial even. Quite handy but we can do more. Much more. Very often you’ll use the following form of the command.

cat /etc/services | grep 'ssh'

Still not extremely interesting in this case but let’s be clear about the implication: You can easily and quickly search for text in any command’s output. Which is way more fast and comfortable than using less or scanning through the output yourself. You need the PID of your bash? No problem:

ps -ef | grep bash

But read on to see why bash is such a popular tool.

13.2.1. Exercises

  1. Give two commands for which searching their output can be useful.
  2. What’s the difference between cat /etc/services | grep 'ssh' and grep 'ssh' /etc/services? Consult the man page of grep to find out.
  3. How can you make grep search for SSH but not ssh (note the case!)?

13.3. Text manipulation tools

In fact, there’s several commands which focus on text processing and text manipulation. They all provide some very basic functionality. In itself, the program won’t do much. But combined with input/outupt redirection and pipelining, they can be very powerful.

Command Description
echo Print some text
cat Output a file
cut Remove sections from a line
tr Change or delete characters
tee Write to the standard output as well as a file
uniq Remove repeated lines
sort Sort text
diff Show the difference between two files. Or two directories.
wc Word count. Counts also lines, bytes and more.

Just for fun, maybe you can appreciate the sequence below.

echo "Hello world" | cut -d ' ' -f 1 | tee /tmp/teaout | tr -c -d '\ne[:upper:]' | cat -n | tr -d ' \t' | tr 1H By | cat - /tmp/teaout | sort -r | cat -n

The man pages of these commands explains the exact meaning of the various options. Some of the tools are mainly useful in scripts (cut, tr, tee) where you cannot (or don’t want to) process a program’s output yourself. Others are very handy to manipulate text files (uniq, sort), also ones not related to commands or program output. Especially sort is quite good to scan through unsorted program output more quickly.

The diff tool reaches very deep into the linux community and is used in many other places (patches, git). Any system changes with time. With diff you get a quick overview over what changed. You can either have a side-by-side view (-y) of the whole document or show the differences (plus context, -C NUM). And you can also see the difference between all files of two directories (-r) or just the summary of what files are different (-q -r).

13.3.1. Exercises

  1. Show the disk usage (du -d 1 -h ~) of your home directory. Use a pipe and sort to sort the output by the size, with the largest items on the bottom.
  2. Write a command which tells you how many files there are in the current directory.
  3. Show the contents of /proc/meminfo on the terminal.
  4. Print the text “Hello world” to the terminal.
  5. How can you control if the character case is taken into account when sorting?

13.4. Regular expressions

What makes grep so powerful? So far you’ve only given a simple keyword to search for. But this section is called Regular expression (or regex for short), so what is this and how is it related to grep?

Consider an example you’ve seen before:

ls /dev/[sh]d[a-z]

The device name follows a certain scheme. The first two letters are either sd or hd. Then comes one additional letter (which identifies the device number). We don’t know yet what files will be there so we want to show any following this scheme. So we formalize the scheme as follows:

  1. Either s or h
  2. d
  3. Any (lowercase) letter from the latin alphabet (a-z)

What’s in between the brackets ([, ]) gives you a choice over characters. You can either define single characters ([ahxY4]) or ranges ([0-9], [a-k]). But you can go further. Let’s talk quantities:

ls /dev/[sh]d[a-z][0-9]*

Now the last part demands any number (0-9). You know the selection with the brackets, but the star (*) was added. Does this look familiar? You’ve seen this one a couple of times and actually know it to be the wildcard. It’s like saying that the number [0-9] may occur an arbitrary number of times. So /dev/sda0, /dev/sda00, /dev/sda012 are all valid. But the wildcard also includes zero occurrences, so /dev/sda is listed also. When you put a wildcard (or another quantity specification) it is applied to the character or character choice given before. Like the range [0-9] in the example.

But in fact, with ls you don’t use regular expressions, they copied just the range and wildcard. Regex goes way further. It’s a language to specify a search term where you know some structure but not everything. It basically goes like this: You put a character (or a list thereof) first, then its quantity. The character is either a letter or a list, as seen above. The quantity can be the wildcard (*, zero or more occurences), a plus (+, at least one occurence), a question mark (?, zero or one occurence) or a specific number of occurrences in curly braces ({4}). If the quantity is one, you don’t put anything.

Note

The wildcard in ls means ‘any characters, an arbitrary number of times’. In regular expression, this is a bit different. There, the same character * means ‘an arbitrary number of times’, but says nothing about the character. You have to put the character in front, like so: a* (‘a’ repeated an arbitrary number of times). So you can write ls *.jpg but not ls | grep '*.jpg. The second command could for example be ls | grep 'jpg'.

hello
h[a-z]llo
hel{2}o
hel?lo
hel+o
help*lo
h[el]*lo

All of the examples above match “hello”, but some of them also match other words. In the last example, the range choice and the quantity was combined, leading to much more interesting words.

Regex Words matched
hello hello
h[a-z]llo hallo, hbllo, hcllo, ...
hel{2}o hello
hel?lo helo, hello
hel+o helo, hello, helllo, hellllo, ...
help*lo hello, helplo, helpplo, helppplo, ...
h[el]*lo hlo, helo, hllo, heelo, hello, hllllllo, hleellellellelelo, ...

Note

In grep, you have to escape some characters. Meaning you have to put a backslash \ in front. This affects the plus sign and the curly braces. So write +, ?, {, } (grep) instead of +, ?, {, } (regex). [1]

Let’s continue a little further. Often you don’t know a single character or searching words where a single character changes. Then, you can just put a dot (.) instead of that character. It basically means “a single arbitrary character”. And I write character on purpose, because it’s not limited to a letter but can also be a digit or a whitespace.

he.lo
h..lo
.ell.
h.*llo

In the last example, the quantifier was again combined with the dot. Meaning the word has to start with ‘h’ and end with ‘llo’, but there can be anything in between.

Regex Words matched
he.lo healo, heblo, heclo, hello, he lo, he?lo, he3lo, ...
h..lo ha3lo, hello, h !lo, h7_lo, ...
.ell. hello, Mello, ?ellX, 4ellB, ...
h.*llo hello, hfoobarllo, hXYZ123llo, ...

OK, so that was a very short introduction into regex. You can even do more and it can become pretty complicated. None the less, I encourage everyone to experiment with the regex and learn it thoroughly. For example by reading the regex-tutorial. It’s an extremely powerful tool and well supported by many terminal applications. For example, you can use the same syntax in less! After you’ve studied the tutorial, also keep the regex-cheat-sheet close.

13.4.1. Exercises

  1. Write down some of the words which would match the following regexes. Make sure your words clarify what the regex does. Explain it to a collegue.

    1. [abc]
    2. The [sun] shines evey day
    3. [xunil]
    4. Fo+ba+r
    5. Fo{2}ba{2}r
    6. Fo?ba?r
    7. L.n.x
    8. L.*n.*x
    9. [0-9] x [1-9]+ - [6238]?3 = 5[89]*3
  2. Write a regular expression which maches all of the listed words. The regex may match more but try to find one which is as close as possible.

    1. Hallo, Hello, Hullo
    2. Halo, Hallo, Halllo, Hallllo, Halllllo
    3. Hallo, Hello, Hullo, Halle, Hull
    4. 99 bottles of beer, 98 bottles of beer, 97 bottles of beer
    5. 90 bottles of beer, 80 bottles of beer, 70 bottles of beer
    6. 99 bottles of beer, 73 bottles of beer, 53 bottles of beer
    7. 1, 12, 123, 1234, 12345, 123456, 1234567
    8. /dev/sda, /dev/sdb, /dev/sdc, /dev/sda1, /dev/sdc8, /dev/sdaX99
    9. The, sun, shines, every, day
  3. Decide if the expressions below are valid regular expressions. If so, explain what it does to a collegue.

    1. foobar
    2. xun+il
    3. *
    4. .
    5. x??
    6. *.jpg

13.5. Summary

Looking for the needle in the haystack isn’t a problem for you anymore - as long as you’re looking for text. You’ve accepted the superiority of the terminal over mouse-driven programs. Using grep comes natural to you. Sorting and manipulating text helps you scanning through files quickly and finding crutial information more easily. You know how to state a search term more generally than just a simple word. You can mix these techniques with everyday commands.

13.5.1. Exercises

  1. What’s the meaning of the dot (.) in regex?
  2. Is it possible to search specific text in any command’s output? How?
  3. Can you always list all of the words, a regular expression can possibly match?
  4. How can you use grep to search for some text in more than one file?
  5. How can you use grep to search for some text in multiple directories?
  6. Why is text search such a powerful tool? Give three situations.

13.6. Cheatsheet

  • grep KEYWORD FILE
[1]Escaping is a general concept in computers, found in many other places as well. It means that special interpretation is removed from a character, if escaped. For example, bash uses the whitespace to find out what arguments you passed to a command: In ls -a -h there’s two arguments, -a and -h, seperated by whitespace. The same goes for ls foo bar, which lists the two files foo and bar (if they exist). Now, what if your file is actually named “foo bar”, so a single file containing a whitespace in its name? Bash doesn’t understand what you mean, gets confused and shows you some error messages. With escaping, you can remove this special meaning (seperation of arguments) from the whitespace character. So, you write ls foo bar instead. It is then interpreted not as an argument boundary but as an actual whitespace character in in the text. We call \ the escaping character. If you want to try this example, a graphical file browser can help you with creating such files. Note that this is a huge issue in many situations, especially programming. If a program doesn’t check its text input for proper escaping, you can make this program interpret the text in a different way than intended. And, for example, run your own code. Read about SQL injection if you want to see an example of this on MySQL databases. Surprisingly, many homepages are vulnerable for this kind of attack and can have their database deleted by an attacker.