Chapter Eight: Filters and Piping

If you've ever learned a foreign language, you know that the most common approach is to start by building your vocabulary (almost always including the names of the months, for some reason), and then you learn about sentence construction rules. The UNIX command line is a lot like a language. Now you've learned a lot of UNIX words, so it's time to learn how to put them together as sentences using file redirection, filters, and pipes.

Commands to be added to your vocabulary this hour include wc, sort, nl, uniq. You also learn about the -n flag to the cat command, which forces cat to add line numbers, and how you can use that to help find information within files.

Goals for This Hour

In this hour, you learn

This hour begins by focusing on one aspect of constructing powerful custom commands in UNIX by using file redirection. The introduction of some filters, programs that are intended to be used as part of command pipes, follow. Next you learn another aspect of creating your own UNIX commands using pipelines.

Task 8.1: The Secrets of File Redirection

So far, all the commands you've learned while teaching yourself UNIX have required you to enter information at the command line, and all have produced output on the screen. But, as Gershwin wrote in Porgy and Bess, "it ain't necessarily so." In fact, one of the most powerful features of UNIX is that the input can come from a file as easily as it can come from the keyboard, and the output can be saved to a file as easily as it can be displayed on your screen.

The secret is file redirection, the special commands in UNIX that instruct the computer to read from a file, write to a file, or even append information to an existing file. Each of these acts can be accomplished by placing a file-redirection command in a regular command line: < redirects input, > redirects output, and >> redirects output and appends the information to the existing file. A mnemonic for remembering which is which is to remember that, just as in English, UNIX works from left to right, so a character that points to the left (<) changes the input, whereas a character that points right (>) changes the output.

  1. Log in to your account and create an empty file using the touch command:

    % touch testme

  2. First, use this empty file to learn how to redirect output. Use ls to list the files in your directory, saving them all to the newly created file:

    % ls -l testme
    -rw-rw-r-- 1 taylor 0 Nov 15 09:11 testme

    % ls -l > testme
    % ls -l testme
    -rw-rw-r-- 1 taylor 120 Nov 15 09:12 testme

    Notice that when you redirected the output, nothing was displayed on the screen; there was no visual confirmation that it worked. But it did, as you can see by the increased size of the new file.

  3. Instead of using cat or more to view this file, try using file redirection:

    % cat < testme
    total 127
    drwx------  2 taylor        512 Nov  6 14:20 Archives/
    drwx------  3 taylor        512 Nov 16 21:55 InfoWorld/
    drwx------  2 taylor       1024 Nov 19 14:14 Mail/
    drwx------  2 taylor        512 Oct  6 09:36 News/
    drwx------  3 taylor        512 Nov 11 10:48 OWL/
    drwx------  2 taylor        512 Oct 13 10:45 bin/
    -rw-rw----  1 taylor      57683 Nov 20 20:10 bitnet.lists.Z
    -rw-rw----  1 taylor      46195 Nov 20 06:19 drop.text.hqx
    -rw-rw----  1 taylor      12556 Nov 16 09:49 keylime.pie
    drwx------  2 taylor        512 Oct 13 10:45 src/
    drwxrwx---  2 taylor        512 Nov  8 22:20 temp/
    -rw-rw----  1 taylor          0 Nov 20 20:21 testme
    

    The results are the same as if you had used the ls command, but the output file is saved, too. You now can easily print the file or go back to it later to compare the way it looks with the way your files look in the future.

  4. Use the ls command to add some further information at the bottom of the testme file, by using >>, the append double-arrow notation:

    % ls -FC >> testme

    Recall that the -C flag to ls forces the system to list output in multicolumn mode. Try redirecting the output of ls -F to a file to see what happens without the -C flag.

  5. It's time for a real-life example. You've finished learning UNIX, and your colleagues now consider you an expert. One afternoon, Shala tells you she has a file in her directory, but she isn't sure what it is. She wants to know what it is, but she can't figure out how to get to it. You try the file command, and UNIX tells you the file is data. You are a bit puzzled. But then you remember file redirection:

    % cat -v < mystery.file > visible.mystery.file

    This command has cat -v take its input from the file mystery.file and save its output in visible.mystery.file. All the nonprinting characters are transformed, and Shala can poke through the file at her leisure.

    Find a file on your system that file reports as a data file, and try using the redirection commands to create a version with all characters printable through the use of cat -v.

There is an infinite number of ways that you can combine the various forms of file redirection to create custom commands and to process files in various ways. This hour has really just scratched the surface. Next, you learn about some popular UNIX filters and how they can be combined with file redirection to create new versions of existing files. Also, study the example about Shala's file, which shows the basic steps in all UNIX file-redirection operations: Specify the input to the command, specify the command, and specify where the output should go.

Task 8.2: Counting Words and Lines Using wc

Writers generally talk about the length of their work in terms of number of words, rather than number of pages. In fact, most magazines and newspapers are laid out according to formulas based on multiplying an average-length word by the number of words in an article.

These people are obsessed with counting the words in their articles, but how do they do it? You can bet they don't count each word themselves. If they're using UNIX, they simply use the UNIX wc program, which computes a word count for the file. It also can indicate the number of characters (which ls -l indicates, too) and the number of lines in the file.

  1. Start by counting the lines, words, and characters in the testme file you created earlier in this hour:

    % wc testme
           4      12     121
           
    % wc < testme
           4      12     121
    
    % cat testme | wc
          4      12     121
    

    All three of these commands offer the same result (which probably seems a bit cryptic now). Why do you need to have three ways of doing the same thing? Later, you learn why this is so helpful. For now, stick to using the first form of the command.

    The output is three numbers, which reveal how many lines, words, and characters, respectively, are in the file. You can see that there are 4 lines, 12 words, and 121 characters in testme.

  2. You can have wc list any one of these counts, or a combination of two, by using different command flags: -w counts words, -c counts characters, and -l counts lines:

    % wc -w testme
       12 testme
       
    % wc -l testme
       4 testme
    
    % wc -wl testme
          12       4 testme
    
    % wc -lw testme
           4      12 testme
    

  3. Now the fun begins. Here's an easy way to find out how many files you have in your home directory:

    % <ls | wc -l
    37

    The ls command lists each file, one per line (because you didn't use the -C flag). The output of that command is fed to wc, which counts the number of lines it's fed. The result is that you can find out how many files you have (37) in your home directory.

  4. How about a quick gauge of how many users are on the system?

    % who | wc -l
    12

  5. How many accounts are on your computer?

    % cat /etc/passwd | wc -l
    3877

The wc command is a great example of how the simplest of commands, when combined in a sophisticated pipeline, can be very powerful.

Task 8.3: Removing Extraneous Lines Using uniq

Sometimes when you're looking at a file, you'll notice that there are many duplicate entries, either blank lines or, perhaps, lines of repeated information. To clean up these files and shrink their size at the same time, you can use the uniq command, which lists each unique line in the file.

Well, it sort of lists each unique line in the file. What uniq really does is compare each line it reads with the previous line. If the lines are the same, uniq does not list the second line. You can use flags with uniq to get more specific results: -u lists only lines that are not repeated, -d lists only lines that are repeated (the exact opposite of -u), and -c adds a count of how many times each line occurred.

  1. If you use uniq on a file that doesn't have any common lines, uniq has no effect.

    % uniq testme
    Archives/               OWL/                    keylime.pie
    InfoWorld/              bin/                    src/
    Mail/                   bitnet.mailing-lists.Z  temp/
    News/                   drop.text.hqx           testme
    

  2. A trick using the cat command is that cat lists the contents of each file sequentially, even if you specify the same file over and over again, so you can easily build a file with lots of lines:

    % cat testme testme testme > newtest

    Examine newtest to verify that it contains three copies of testme, one after the other. (Try using wc.)

  3. Now you have a file with duplicate lines. Will uniq realize these files have duplicate lines? Use wc to find out:

    % wc newtest
       12   36   363
    
    % uniq newtest | wc
       12   36   363
    

    They're the same. Remember, the uniq command removes duplicate lines only if they're adjacent.

  4. Create a file that has duplicate lines:

    % tail -1 testme > lastline
    
    % cat lastline lastline lastline lastline > newtest2
    
    % cat newtest2
    News/                   drop.text.hqx           testme
    News/                   drop.text.hqx           testme
    News/                   drop.text.hqx           testme
    News/                   drop.text.hqx           testme
    

    Now you can see what uniq does:

    % uniq newtest2
    News/ drop.text.hqx testme

  5. Obtain a count of the number of occurrences of each line in the file. The -c flag does that job:

    % uniq -c newtest2
    4 News/ drop.text.hqx testme

    This shows that this line occurs four times in the file. Lines that are unique have no number preface.

  6. You also can see what the -d and -u flags do, and how they have exactly opposite actions:

    % uniq -d newtest2
    News/                   drop.text.hqx           testme
    
    % uniq -u newtest2
    
    %
    

    Why did the -u flag list no output? The answer is that the -u flag tells uniq to list only those lines that are not repeated in the file. Because the only line in the file is repeated four times, there's nothing to display.

Given this example, you probably think uniq is of marginal value, but you will find that it's not uncommon for files to have many blank lines scattered willy-nilly throughout the text. The uniq command is a fast, easy, and powerful way to clean up such files.

Task 8.4: Sorting Information in a File Using sort

Whereas wc is useful at the end of a pipeline of commands, uniq is a filter, a program that is really designed to be tucked in the middle of a pipeline. Filters, of course, can be placed anywhere in a line, anywhere that enables them to help direct UNIX to do what you want it to do. The common characteristic of all UNIX filters is that they can read input from standard input, process it in some manner, and list the results in standard output. With file redirection, standard input and output also can be files. To do this, you can either specify the filenames to the command (usually input only) or use the file-redirection symbols you learned earlier in this hour (<, >, and >>).


***Just a Minute***

Standard input and standard output are two very common expressions in UNIX. When a program is run, the default location for receiving input is called standard input. The default location for output is standard output. If you are running UNIX from a terminal, standard input and output are your terminal.

There is a third I/O location, standard error. By default, this is the same as standard output, but you can re-direct standard error to a different location than standard output. You learn more about I/O redirection later in the book.

***End Just a Minute***

One of the most useful filters is sort, a program that reads information and sorts it alphabetically. You can customize the behavior of this program, like all UNIX programs, to ignore the case of words (for example, to sort Big between apple and cat, rather than before - most sorts put all uppercase letters before the lowercase letters), and to reverse the order of a sort (z to a). The program sort also enables you to sort lists of numbers.

Few flags are available for sort, but they are powerful, as shown in Table 8.1.

FlagFunction
-b Ignore leading blanks.
-d Sort in dictionary order (only letters, digits, and blanks are significant).
-f Fold uppercase into lowercase; that is, ignore the case of words.
-n Sort in numerical order.
-r Reverse order of the sort.

Table 8.1. Flags for the sort command.

  1. By default, the ls command sorts the files in a directory in a case-sensitive manner. It first lists those files that begin with uppercase letters and then those that begin with lowercase letters:

    % ls -1F
    Archives/
    InfoWorld/
    Mail/
    News/
    OWL/
    bin/
    bitnet.mailing-lists.Z
    drop.text.hqx
    keylime.pie
    src/
    temp/
    testme
    


    ***Begin Just a Minute***

    To force ls to list output one file per line, you can use the -1 flag (that's the number one, not a lowercase L).

    ***End Just a Minute***

    To sort filenames alphabetically regardless of case, you can use sort -f:

    % ls -1 | sort -f
    Archives/
    bin/
    bitnet.mailing-lists.Z
    drop.text.hqx
    InfoWorld/
    keylime.pie
    Mail/
    News/
    OWL/
    src/
    temp/
    testme
    

  2. How about sorting the lines of a file? You can use the testme file you created earlier:

    % sort < testme
    Archives/               OWL/                    keylime.pie
    InfoWorld/              bin/                    src/
    Mail/                   bitnet.mailing-lists.Z  temp/
    News/                   drop.text.hqx           testme
    

  3. Here's a real-life UNIX example. Of the files in your home directory, which are the largest? The ls -s command indicates the size of each file, in blocks, and sort -n sorts numerically:

    % ls -s | sort -n
    total 127
       1 Archives/
       1 InfoWorld/
       1 Mail/
       1 News/
       1 OWL/
       1 bin/
       1 src/
       1 temp/
       1 testme
      13 keylime.pie
      46 drop.text.hqx
      64 bitnet.mailing-lists.Z
    

    It would be more convenient if the largest files were listed first in the output. That's where the -r flag to reverse the sort order can be useful:

    % ls -s | sort -nr
      64 bitnet.mailing-lists.Z
      46 drop.text.hqx
      13 keylime.pie
       1 testme
       1 temp/
       1 src/
       1 bin/
       1 OWL/
       1 News/
       1 Mail/
       1 InfoWorld/
       1 Archives/
    total 127
    

  4. One more refinement is available to you. Instead of listing all the files, use the head command, and specify that you want to see only the top five entries:

    % ls -s | sort -nr | head -5
      64 bitnet.mailing-lists.Z
      46 drop.text.hqx
      13 keylime.pie
       1 testme
       1 temp/
    

    That's a powerful and complex UNIX command, yet it is composed of simple and easy-to-understand components.

Like many of the filters, sort isn't too exciting by itself. As you explore UNIX further and learn more about how to combine these simple commands to build sophisticated instructions, you will begin to see their true value.

Task 8.5: Number Lines in Files Using cat, -n, and nl

It often can be helpful to have a line number listed next to each line of a file. It's quite simple to do with the cat program by specifying the -n flag to number lines in the file displayed.

On many UNIX systems, there's a considerably better command for numbering lines in a file and for many other tasks. The command nl, for number lines, is an AT&T System V command. A system that doesn't have the nl command will complain nl: command not found. If you have this result, experiment with cat -n instead.

  1. Because one of my own systems did not have the nl command, I moved to one that had the nl command for this example. I quickly rebuilt the testme file:

    % ls -l > testme

    To see line numbers now, cat -n will work fine:

    % cat -n testme
         1  total 60
         2  -rw-r--r--  1 taylor   1861 Jun  2  1992 Global.Software
         3  -rw-------  1 taylor  22194 Oct  1  1992 Interactive.Unix
         4  drwx------  4 taylor   4096 Nov 13 11:09 Mail/
         5  drwxr-xr-x  2 taylor   4096 Nov 13 11:09 News/
         6  drwxr-xr-x  2 taylor   4096 Nov 13 11:09 Src/
         7  drwxr-xr-x  2 taylor   4096 Nov 13 11:09 bin/
         8  -rw-r--r--  1 taylor  12445 Sep 17 14:56 history.usenet.Z
         9  -rw-r--r--  1 taylor      0 Nov 20 18:16 testme
    

  2. The alternative, which does exactly the same thing here, is to try nl without any flags:

    % nl testme
         1  total 60
         2  -rw-r--r--  1 taylor   1861 Jun  2  1992 Global.Software
         3  -rw-------  1 taylor  22194 Oct  1  1992 Interactive.Unix
         4  drwx------  4 taylor   4096 Nov 13 11:09 Mail/
         5  drwxr-xr-x  2 taylor   4096 Nov 13 11:09 News/
         6  drwxr-xr-x  2 taylor   4096 Nov 13 11:09 Src/
         7  drwxr-xr-x  2 taylor   4096 Nov 13 11:09 bin/
         8  -rw-r--r--  1 taylor  12445 Sep 17 14:56 history.usenet.Z
         9  -rw-r--r--  1 taylor      0 Nov 20 18:16 testme
    

  3. Notice that both commands can also number lines fed to them via a command pipeline:

    % ls -CF | cat -n
         1  Global.Software     News/           history.usenet.Z
         2  Interactive.Unix    Src/            testme
         3  Mail/               bin/            
    % ls -CF | nl
         1  Global.Software     News/           history.usenet.Z
         2  Interactive.Unix    Src/            testme
         3  Mail/               bin/
    

Like many other UNIX tools, nl and its doppelganger cat -n aren't very thrilling by themselves. As additional members in the set of powerful UNIX tools, however, they can prove tremendously helpful in certain situations. As you soon will see, nl also has some powerful options that can make it a bit more fun.

Task 8.6: Cool nl Tricks and Capabilities

A program that prefaces each line with a line number isn't much of an addition to the UNIX command toolbox, so the person who wrote the nl program added some further capabilities. With different command flags, nl can either number all lines (by default it numbers only lines that are not blank) or skip line numbering (which means it's an additional way to display the contents of a file). The best option, though, is that nl can selectively number just those lines that contain a specified pattern.


***Begin Just a Minute***

If you don't have the nl command on your system, I'm afraid you're out of luck in this section. Later in the book, you learn other ways to accomplish these tasks. For now, though, if you don't have nl, skip to the next hour and start to learn about the grep command.

***End Just a Minute***

The command flag format for nl is a bit more esoteric than you've seen up to this point. The different approaches to numbering lines with nl are all modifications of the -b flag (for body numbering options). The four flags are -ba, which numbers all lines; -bt, which numbers printable text only; -bn, which results in no numbering; and -bp pattern, for numbering lines that contain the specified pattern.

One final option is to insert a different separator between the line number and the line by telling nl to use -s, the separator flag.

  1. To begin, I'll use a command that you haven't seen before to add a few blank lines to the testme file. The echo command simply writes back to the screen anything specified. Try echo hello.

    % rm testme
    % ls -CF > testme
    % echo "" >> testme
    % echo "" >> testme
    % ls -CF >> testme
    % cat testme
    Global.Software         News/               history.usenet.Z
    Interactive.Unix        Src/                testme
    Mail/                   bin/
    
    
    Global.Software         News/               history.usenet.Z
    Interactive.Unix        Src/                testme
    Mail/                   bin/
    


    ***Begin Just a Minute***

    Parts of UNIX are rather poorly designed, as you have already learned. For example, if you use the echo command without arguments, you get no output. However, if you add an empty argument (a set of quotation marks with nothing between them), echo outputs a blank line. It doesn't make much sense, but it works.

    ***End Just a Minute***

  2. Now watch what happens when nl uses its default settings to number the lines in testme:

    % nl testme
         1  Global.Software     News/             history.usenet.Z
         2  Interactive.Unix    Src/              testme
         3  Mail/               bin/
    
    
         4  Global.Software     News/             history.usenet.Z
         5  Interactive.Unix    Src/              testme
         6  Mail/               bin/
    

    You can accomplish the same thing by specifying nl -bt testme. Try this to verify that your system gives the same results.

  3. It's time to use one of the new two-letter command options to number the lines, including the blank lines:

    % nl -ba testme
         1  Global.Software     News/             history.usenet.Z
         2  Interactive.Unix    Src/              testme
         3  Mail/               bin/
         4
         5
         6  Global.Software     News/             history.usenet.Z
         7  Interactive.Unix    Src/              testme
         8  Mail/               bin/
    

  4. If you glance at the contents of my testme file, you can see that two lines contain the word history. To have nl number just those lines, try the -bp pattern-matching option:

    % nl -bphistory testme
         1  Global.Software     News/             history.usenet.Z
            Interactive.Unix Src/                  testme
            Mail/                bin/
    
    
         2  Global.Software     News/             history.usenet.Z
            Interactive.Unix Src/                  testme
            Mail/                bin/
    

    Notice that numbering the two lines has caused the rest of the lines to fall out of alignment on the display.

  5. This is when the -s, or separator, option comes in handy:

    % nl -bphistory -s: testme
         1:Global.Software      News/             history.usenet.Z
           Interactive.Unix Src/                  testme
           Mail/                bin/
    
         2:Global.Software      News/              history.usenet.Z
           Interactive.Unix Src/                   testme
           Mail/                bin/
    

    In this case, I specified that instead of using a tab, which is the default separator between the number and line, nl should use a colon. As you can see, the output now lines up again.

    Just about anything can be specified as the separator, as sensible or weird as it might be:

    % nl -s', line is: ' testme
         1, line is: Global.Software        News/        history.usenet.Z
         2, line is: Interactive.Unix       Src/         testme
         3, line is: Mail/                  bin/
         4, line is: Global.Software        News/        history.usenet.Z
         5, line is: Interactive.Unix       Src/         testme
         6, line is: Mail/                  bin/
    

    Notice the use of single quotation marks (') in this example. I want to include spaces as part of my pattern, so I need to ensure that the program knows this. If I didn't use the quotation marks, nl would use a comma as the separator and then tell me that it couldn't open a file called line or is:.

The nl command demonstrates that there are plenty of variations on simple commands. When you read earlier that you would learn how to number lines in a file, did you think that this many subtleties were involved?

Summary

You have learned quite a bit in this hour and are continuing down the road to UNIX expertise. You learned about file redirection. You can't go wrong by spending time studying these closely. The concept of using filters and building complex commands by combining simple commands with pipes has been more fully demonstrated here, too. This higher level of UNIX command language is what makes UNIX so powerful and easy to mold.

This hour hasn't skimped on commands, either. It introduced wc for counting lines, words, and characters in a file (or more than one file: try wc * in your home directory). You also learned to use the uniq, sort, and spell commands. You learned about using nl for numbering lines in a file - in a variety of ways - and cat -n as an alternative "poor person's" line-numbering strategy. You also were introduced to the echo command.

By the way, the echo command also can tell you about specific environment variables, just like env or printenv do. Try echo $HOME or echo $PATH to see what happens, and compare the output with env HOME and env PATH.

Workshop

The Workshop summarizes the key terms you learned and poses some questions about the topics presented in this chapter. It also provides you with a preview of what you will learn in the next hour.

Key Terms

file redirection
Most UNIX programs expect to read their input from the user (that is, standard input) and write their output to the screen (standard output). By use of file redirection, however, input can come from a previously created file, and output can be saved to a file instead of being displayed on the screen.

filter
Filters are a particular type of UNIX program that expects to work either with file redirection or as part of a pipeline. These programs read input from standard input, write output to standard output, and often don't have any starting arguments.

standard input
UNIX programs always default to reading information from the user by reading the keyboard and watching what's typed. With file redirection, input can come from a file, and with pipelines, input can be the result of a previous UNIX command.

standard error
This is the same as standard output, but you can re-direct standard error to a different location than standard output.

standard output
When processing information, UNIX programs default to displaying the output on the screen itself, also known as standard output. With file redirection, output can easily be saved to a file; with pipelines, output can be sent to other programs.

Questions

  1. The placement of file-redirection characters is important to ensure that the command works correctly. Which of the following do you think will work, and why?

    < file wc       wc file <       wc < file      
    cat file | wc       cat < file | wc       wc | cat      

    Now try them and see if you're correct.

  2. The wc command can be used for lots of different tasks. Try to imagine a few that would be interesting and helpful to learn (for example, How many users are on the system right now?). Try them on your system.

  3. Does the file size listed by wc -c always agree with the file size listed by the ls command? With the size indicated by ls -s? If there is any difference, why?

  4. What do you think would happen if you tried to sort a list of words by pretending they're all numbers? Try it with the command ls -1 | sort -n to see what happens. Experiment with the variations.

  5. Do you spell your filenames correctly? Use spell to find out.

Preview of the Next Hour

The next hour introduces wildcards and regular expressions, and tools to use those powerful concepts. You learn how these commands can help you extract data from even the most unwieldy files.

You learn one of the secret UNIX commands for those really in the know, the secret-society, pattern-matching program grep. Better yet, you learn how it got its weird and confusing name! You also learn about the tee command and the curious-but-helpful << file-redirection command.


© 1997 Intuitive Systems & Sequoia Consulting

Go To Sams.Net