I originally created and posted this November 22, 2004.
Here are some examples of using the utilities found on Unix (available on some other platforms also) for manipulating the text in files. awk and perl both allow writing full programs, but I primarily use both as short one-liner programs which allows them to be piped to/from other Unix programs. Each of these programs has capabilities that make it better than the others in some situations which I have attempted to demonstrate below. I don’t claim any of these to be original to me; references are at the bottom of the page.
I have collected this information over the course of several years, during which time I have used Sun Solaris and various flavors of Linux. Note that the versions of these tools included with Solaris don’t entirely match the GNU versions, so some of what you see below may need to be tinkered with to make work.
The philosophy of Unix utilities is to develop a tool that is very good at doing a specific thing. The output of a tool can be sent to another tool via the pipe (i.e., the | character) as shown in several examples below. So, one program’s output becomes the next program’s input.
awk cat csplit cut find fmt fold grep head join nl paste perl sdiff sed sort split tail uniq wc
Examples References
sed, awk, and perl
awk — good for working with files that contain information in columns.
-
- Display only the first three columns of the file SOMEFILE, using tabs to separate the results:
awk ‘{print $1 “\t\t” $2 “\t” $3}’ SOMEFILE |
-
- Display the first and fifth columns of the password file with a tab between them
awk -F: ‘{print $1 “\t” $5}’ /etc/passwd |
-F: changes the column delimiter from spaces (the default) to a colon (:)
-
- Display the second column of the file using double colons as the field separator
awk -v ‘FS=::’ ‘{print $2}’ ratings.dat |
-
- replace first column as “ORACLE” in SOMEFILE
awk ‘{$1 = “ORACLE”; print }’ SOMEFILE |
-
- print the last field of every input line:
awk ‘{ print $NF }’ SOMEFILE |
-
- print the first 50 characters of each line. if a line has fewer than 50 characters, then the line is padded with spaces.
awk ‘{ printf(“%-50.50s\n”, $0) }’ SOMEFILE |
-
- sum the values in column 1
awk ‘BEGIN{total=0;} {total += $1;} END{print “total is “, total}’ SOMEFILE |
-
- sum the values in columns 1, 2 and 4 in order to calculate precision and recall
awk -F ‘,’ ‘BEGIN{TP=0; FP=0; FN=0} {TP += $1; FP += $2; FN += $4} END{print “precision is “, TP/(FP+TP); print “recall is “, TP/(FN+TP)}’ prec-recall-2states.txt |
-
- sum each row
awk ‘{sum=0; for(i=1; i<=NF; i++){sum+=$i}; print sum}’ SOMEFILE |
sed — from the man page:
Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is sed’s ability to filter text in a pipeline which particularly distinguishes it from other types of editors.
-
- Double space infile and send the output to outfile
I use the input/output notation shown above. It is appropriate in many, if not all, cases to leave out the less than sign, e.g., sed G infile > outfile
-
- Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text.
sed ‘/^$/d;G’ < infile > outfile |
-
- Triple space a file
sed ‘G;G’ < infile > outfile |
-
- Undo double-spacing (assumes even-numbered lines are always blank)
sed ‘n;d’ < infile > outfile |
-
- Insert a blank line above every line which matches regex (“regex” represents a regular expression)
sed ‘/regex/{x;p;x;}’ < infile > outfile |
-
- Print the line immediately before regex, but not the line containing regex
sed -n ‘/regexp/{g;1!p;};h’ < infile > outfile |
-
- Print the line immediately after regex, but not the line containing regex
sed -n ‘/regexp/{n;p;}’ < infile > outfile |
-
- Insert a blank line below every line which matches regex
sed ‘/regex/G’ < infile > outfile |
-
- Insert a blank line above and below every line which matches regex
sed ‘/regex/{x;p;x;G;}’ < infile > outfile |
-
- Convert DOS newlines (CR/LF) to Unix format
sed ‘s/^M$//’ < infile > outfile # in bash/tcsh, to get ^M press Ctrl-V then Ctrl-M |
-
- Print only those lines matching the regular expression—similar to grep
sed -n ‘/some_word/p’ infile sed ‘/some_word/!d’
|
-
- Print those lines that do not match the regular expression—similar to grep -v
sed -n ‘/regexp/!p’ sed ‘/regexp/d’
|
-
- Skip the first two lines (start at line 3) and then alternate between printing 5 lines and skipping 3 for the entire file
sed -n ‘3,${p;n;p;n;p;n;p;n;p;n;n;n;}’ < infile > outfile |
Notice that there are five p’s in the sequence, representing the five lines to print. The three lines to skip between each set of lines to print are represented by the n;n;n; at the end of the sequence.
-
- Delete trailing whitespace (spaces, tabs) from end of each line
sed ‘s/[ \t]*$//’ < infile > outfile |
-
- Substitute (find and replace) foo with bar on each line
sed ‘s/foo/bar/’ < infile > outfile # replaces only 1st instance in a line sed ‘s/foo/bar/4’ < infile > outfile # replaces only 4th instance in a line sed ‘s/foo/bar/g’ < infile > outfile # replaces ALL instances in a line |
-
- Replace each occurrence of the hexadecimal character 92 with an apostrophe:
sed s/\x92/’/g” < old_file.txt > new_file.txt |
-
- Print section of file between two regular expressions (inclusive)
sed -n ‘/regex1/,/regex1/p’ < old_file.txt > new_file.txt |
-
- Combine the line containing REGEX with the line that follows it
sed -e ‘N’ -e ‘s/REGEX\n/REGEX/’ < old_file.txt > new_file.txt |
perl — can do anything sed and awk can do, but not always as easily as shown in the examples above.
-
- replace OLDSTRING with NEWSTRING in the file(s) in FILELIST [e.g., file1 file2 or *.txt]
perl -pi.bak -e ‘s/OLDSTRING/NEWSTRING/g’ FILELIST |
The options used are:
-
-
- -e — allows a one-line script to be ran from the command line
- -i — files are edited in place. In the example above, the .bak extension will be placed on original files
- -p — causes the script to be placed in a while loop that iterates over the filename arguments
-
- the full perl program to do the same as the one-liner (without creating backup copies) is
#!/usr/bin/perl
# perl-example.pl
while (<>)
{
s/OLDSTRING/NEWSTRING/g;
print;
}
|
run using ./perl-example.pl FILELIST
-
- remove the carriage returns necessary for DOS text files from files on the Unix system
perl -pi.bak -e ‘s/\r$//g’ FILELIST |
Assorted Utilities
Some of the examples below use the following files:
file1 |
file2 |
Tom 123 Main
Dick 4787 West
Harry 98 North
Sue 1035 Cooper
|
Tom programmer
Dick lawyer
Harry artist
|
ga.txt |
The Gettysburg Address
Gettysburg, Pennsylvania
November 19, 1863
Four score and seven years ago our fathers brought forth on this continent,
a new nation, conceived in Liberty, and dedicated to the proposition that
all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any
nation so conceived and so dedicated, can long endure. We are met on a great
battle-field of that war. We have come to dedicate a portion of that field,
as a final resting place for those who here gave their lives that that nation
might live. It is altogether fitting and proper that we should do this.
But, in a larger sense, we can not dedicate -- we can not consecrate -- we
can not hallow -- this ground. The brave men, living and dead, who struggled
here, have consecrated it, far above our poor power to add or detract. The
world will little note, nor long remember what we say here, but it can never
forget what they did here. It is for us the living, rather, to be dedicated
here to the unfinished work which they who fought here have thus far so
nobly advanced. It is rather for us to be here dedicated to the great task
remaining before us -- that from these honored dead we take increased devotion
to that cause for which they gave the last full measure of devotion -- that we
here highly resolve that these dead shall not have died in vain -- that this
nation, under God, shall have a new birth of freedom -- and that government
of the people, by the people, for the people, shall not perish from the earth.
Source: The Collected Works of Abraham Lincoln, Vol. VII, edited by Roy
P. Basler.
|
In the examples using these files, the percent sign (%) at the beginning of the line represents the command prompt. Comments of what is happening follow the pound sign (#).
grep — prints the lines of a file that match a search string (string can be a regular expression)
grep -i string some_file # print the lines containing string regardless of case grep -v string some_file # print the lines that don’t contain string grep -E “string1|string2” some_file # print the lines that contain string1 or string2 |
find — find has many parameters for restricting what it finds, but I only demonstrate here how to use it to recursively search from the current location for files containing the_word. More examples of using find.
find . -type f -print | xargs grep the_word 2>/dev/null find . -type f -exec grep ‘the_word’ {} \; -print |
In the first example, results of the find command are piped to grep; xargs is used to pass the filenames one at a time to grep. The value of STDERR (the errors) is eliminated by using 2>/dev/null. The second example shows how to grep each filename by using a command-line option of find.
Operations on entire files
cat — concatenate files and print on the standard output
% cat -E file2 # display file2, showing $ at end of each line
Tom programmer$
Dick lawyer$
Harry artist$
cat -v somefile # display somefile, showing nonprinting characters using ^ and M- notation, except for LFD and TAB
cat -e somefile # display somefile, combining the effects of -v and -E
|
nl — Number lines of files
% nl file1
1 Tom 123 Main
2 Dick 4787 West
3 Harry 98 North
4 Sue 1035 Cooper
|
wc — print the number of bytes, words, and lines in files
% wc -l file1 # print number of lines
4 file1
% wc -w file1 # print number of words
12 file1
% wc -m file1 # print number of characters
60 file1
% wc file1 # print number of lines, characters, and words
4 12 60 file1
|
Alter the format of a file
fmt — Reformat each paragraph of a file
% fmt -w 50 ga.txt # reformat to 50 characters per line
The Gettysburg Address Gettysburg, Pennsylvania
November 19, 1863
Four score and seven years ago our fathers
brought forth on this continent, a new nation,
conceived in Liberty, and dedicated to the
proposition that all men are created equal.
Now we are engaged in a great civil war, testing
whether that nation, or any nation so conceived
and so dedicated, can long endure. We are met on
a great battle-field of that war. We have come
to dedicate a portion of that field, as a final
resting place for those who here gave their lives
that that nation might live. It is altogether
fitting and proper that we should do this.
But, in a larger sense, we can not dedicate --
we can not consecrate -- we can not hallow --
this ground. The brave men, living and dead, who
struggled here, have consecrated it, far above
our poor power to add or detract. The world will
little note, nor long remember what we say here,
but it can never forget what they did here. It is
for us the living, rather, to be dedicated here
to the unfinished work which they who fought here
have thus far so nobly advanced. It is rather
for us to be here dedicated to the great task
remaining before us -- that from these honored
dead we take increased devotion to that cause for
which they gave the last full measure of devotion
-- that we here highly resolve that these dead
shall not have died in vain -- that this nation,
under God, shall have a new birth of freedom --
and that government of the people, by the people,
for the people, shall not perish from the earth.
Source: The Collected Works of Abraham Lincoln,
Vol. VII, edited by Roy P. Basler.
|
fold — wrap each input line to fit in specified width
% fold -w 50 ga.txt
The Gettysburg Address
Gettysburg, Pennsylvania
November 19, 1863
Four score and seven years ago our fathers brought
forth on this continent,
a new nation, conceived in Liberty, and dedicated
to the proposition that
all men are created equal.
Now we are engaged in a great civil war, testing w
hether that nation, or any
nation so conceived and so dedicated, can long end
ure. We are met on a great
battle-field of that war. We have come to dedicate
a portion of that field,
as a final resting place for those who here gave t
heir lives that that nation
might live. It is altogether fitting and proper th
at we should do this.
But, in a larger sense, we can not dedicate -- we
can not consecrate -- we
can not hallow -- this ground. The brave men, livi
ng and dead, who struggled
here, have consecrated it, far above our poor powe
r to add or detract. The
world will little note, nor long remember what we
say here, but it can never
forget what they did here. It is for us the living
, rather, to be dedicated
here to the unfinished work which they who fought
here have thus far so
nobly advanced. It is rather for us to be here ded
icated to the great task
remaining before us -- that from these honored dea
d we take increased devotion
to that cause for which they gave the last full me
asure of devotion -- that we
here highly resolve that these dead shall not have
died in vain -- that this
nation, under God, shall have a new birth of freed
om -- and that government
of the people, by the people, for the people, shal
l not perish from the earth.
Source: The Collected Works of Abraham Lincoln, Vo
l. VII, edited by Roy
P. Basler.
|
Output parts of files
head — Output the first part of files
% head -2 file1 # print the first two lines
Tom 123 Main
Dick 4787 West
|
tail — Output the last part of files
% tail -2 file1 # display the last 2 lines
Harry 98 North
Sue 1035 Cooper
|
split — Split a file into pieces (default is 1000 lines each)
split somefile # create files of the form xaa, xab, and so on
split -l 500 somefile # each new file will be at most 500 lines long
|
csplit — split a file into sections determined by context lines
csplit bigfile /The End/+4 # break at the line that is 4 lines below The End
cpslit -k bigfile /The End/+1 "{99}" # break at the line below each occurrence of The End up to 99 times
|
Operate on fields within a line
cut — print selected parts of lines from
% cut -c1-10 file2 # cut characters 1 through 10 from file2
Tom progra
Dick lawye
Harry arti
% cut -d " " -f2 file1 # cut the second column (-f2); use a space as the delimiter (-d " ")
123
4787
98
1035
ls *.txt | cut -c1-3 | xargs mkdir # create directories with the names of the first three letters of each .txt file
|
paste — merge lines of files, separated by tabs. The columns of the input files are placed side-by-side with each other.
% paste file1 file2
Tom 123 Main Tom programmer
Dick 4787 West Dick lawyer
Harry 98 North Harry artist
Sue 1035 Cooper
|
join — join lines of two files on a common field (files should be sorted by common field)
% join -a 2 -a 1 -o 1.1,1.2,2.2 -e " " file1 file2
Tom 123 programmer
Dick 4787 lawyer
Harry 98 artist
Sue 1035
join -a 2 -a 1 -o 1.1,1.2,2.2 -e " " -1 1 -2 3 file1 file2
|
-a list unpairable lines in file1 and file2
-o display fields 1 and 2 of file1 field 2 of file2
-e replace any empty output fields with blanks
-1 join on field 1 of file1
-2 join on field 3 of file2
sdiff — print differences between files
-s supress identical lines
Operate on sorted files
sort — sort lines of text files
% sort +1 file1 # sort on the second column (the count starts at zero)
Sue 1035 Cooper
Tom 123 Main
Dick 4787 West
Harry 98 North
% sort -n +1 file1 # perform a numeric sort (-n) by the second column
Harry 98 North
Tom 123 Main
Sue 1035 Cooper
Dick 4787 West
|
use lensort to sort by line length
use chunksort to sort paragraphs separated by a blank line
uniq — displays unique lines from a sorted file
cat SOMEFILE | sort | uniq # this could have been done easier with sort SOMEFILE | uniq
uniq -c filename # prefix lines by the number of occurrences
uniq -d filename # display the lines that are not unique
uniq -D filename # print all duplicate lines
uniq -i filename # ignore differences in case when comparing
uniq -s filename # avoid comparing the first N characters
uniq -u filename # only print unique lines
|
To perform these operations on multiple files, it is often helpful to create a simple shell script to operate on the appropriate files.
Assorted Examples that Combine Tools
These examples don’t necessarily rely on the sample files given above.
-
- find all files beginning in the current directory and sum the number of lines in them
find . -exec wc -l {} \; | awk ‘{total = total+$1;print total ” ” $1 ” ” $2}’ |
-
- print the 4th, 3rd, and 2nd columns of SOMEFILE (in that order), and sort on the last column (the 2nd column of the original file)
cat SOMEFILE | awk ‘{ print $4 ” ” $3 ” ” $2 }’ | sort +2 |
-
- print total size of all files
find . -type f -name “*.*” -ls | awk ‘BEGIN{ FILECNT = 0; T_SIZE = 0;} { T_SIZE += $7; FILECNT++} END{print “Total Files:”, FILECNT, “Total Size:”, T_SIZE,”Average Size:”, T_SIZE / FILECNT;}’ |
-
- list all files with a size less than 100 bytes
ls -l | awk ‘{if ($5 < 100) {print $5 ” ” $8}}’ |
here $5 represents the column of file sizes produced by ls -l
-
- delete all files with a size less than 100 bytes
ls -l | awk ‘{if ($5 < 100) {print $8}}’ | xargs -i -t rm \{} |
-
- if the number in the second column is less than 1000, prefix it with a zero
awk ‘{if ($2 < 1000) {print $1 ” 0″ $2 ” ” $3} else {print $1 ” ” $2 ” ” $3}}’ < dvd-titles2.sh > dvd-titles3.sh |
-
- combine file1 and file2 and show TAB characters as ^I
% paste file1 file2 | cat -T Tom 123 Main ^ITom programmer Dick 4787 West^IDick lawyer Harry 98 North^IHarry artist Sue 1035 Cooper^I |
-
- sort ratings.dat on column 2 and subsort on column 0 using : as the delimiter, redirecting the output to ratings-sorted.dat
sort -t : -n +2 +0 ratings.dat > ratings-sorted.dat |
-
- cut the first and third columns of movies-ratings.dat, using the : as the delimiter, and count the unique lines
cut -d : -f 1,3 movies-ratings.dat | uniq -c |
-
- In a file where each line begins with ‘File’ followed by one or more digits followed by ‘=’, e.g., ‘File23=’, find the duplicates
awk -F = ‘{print $2}’ untitled.pls |sort|uniq -c |sort |
-
- Find all files from the current location with filenames of at least 50 characters
find . -exec basename {} \; | sed -n ‘/^.\{50\}/p’ |
-
- A file of closed captions needs to be cleaned up. Search for the blank lines and remove them as well as the two lines that follow the blank lines. This works by not printing everything from the blank line (/^$/) to the line with the colons (/:/). Since the first section to clean up doesn’t have a blank line to look for, begin on the 3rd line of the file.
% head -7 0273-mary_shelleys_frankenstein.cc 1 00:00:30,063 –> 00:00:33,066 [ Woman ] “I BUSIED MYSELF TO THINK OF A STORY…
2 00:00:33,066 –> 00:00:37,570 “WHICH WOULD SPEAK TO THE MYSTERIOUS FEARS OF OUR NATURE…
3 00:00:37,570 –> 00:00:39,572 “AND AWAKEN… % % sed -n ‘3,${/^$/,/:/!p}’ < 3370-betrayed.cc > 3370-betrayed.cc.clean % % head -7 0273-mary_shelleys_frankenstein.cc.clean [ Woman ] “I BUSIED MYSELF TO THINK OF A STORY… “WHICH WOULD SPEAK TO THE MYSTERIOUS FEARS OF OUR NATURE… “AND AWAKEN… |
-
- Search for lines containing ::0038:: or ::0148:: or ::0187::, use sed to replace the :: field delimiters with a %, and then perform a numerical sort on the second column. Note that egrep is equivalent to grep -E
$ egrep “::0038::|::0148::|::0187::” ratings.dat | sed ‘s/::/%/g’ | sort -t % +1 -n > match-ratings.txt |
-
- determine the disk usage of each subdirectory of the current directory, sort in descending order, and format for readability
$ du -s *|sort -n -r|awk ‘{printf(“%8.0fKB %s\n”, $1, $2)}’ 29223820KB bob 23038660KB tom 19999376KB sue 11010288KB andy |
-
- for columns 3-6125, find those columns that have some value other than ‘0,’ and count the number of occurrences
#!/bin/sh
for col in $(seq 3 6125); do
echo "column $col"
awk '{print $'$col'}' allshots2nd10minutes.shots | grep -vc "0,"
done
|
-
- print column 51 followed by the line number for this value, sorted by the values from column 51
$ awk ‘{print $51 “\t” FNR}’ allshots2nd5-10thIframes-sparse.shots |sort |
-
- extract the 6th column from all but the last line of somefile
$ head -n -1 somefile | awk ‘{print $6}’ |
-
- print all but the first column of somefile
$ awk -f remove_first_column.awk somefile |
-
where the file
remove_first_column.awk
-
consists of the following:
# remove_first_column.awk
BEGIN {
ORS=""
}
{
for (i = 2; i <= NF; i++)
if (i == NF)
print $i "\n"
else
print $i " "
}
|
-
- The first line of file1 contains header information, which we don’t want. file2 lacks the column headers and therefore contains one less line than file1. Extract all but the first line of file1 and combine with the columns of file2 to create file3 with the vertical bar (|) as the delimiter between the columns of each.
$ tail -n+2 file1 | paste -d ‘|’ – file2 > file3 |
-
- delete the lines up to and including the regular expression (REGEX)
$ sed ‘1,/REGEX/d;’ somefile.txt |
-
- delete the lines up to the regular expression (REGEX)
$ sed -e ‘/REGEX/p’ -e ‘1,/REGEX/d;’ somefile.txt |
-
- delete all newlines (this turns the entire document into a single line
$ tr -d ‘\n’ < somefile.txt |
-
- combine groups of nonblank lines into a single line, where each group is separated by a single blank line. This works by first changing each blank line to XXXXX; second, each newline is replaced by a space; third, each XXXXX is now replaced with a newline in order to separate the original groups into lines.
$ cat somefile.txt
this is the
first section of
the file
this is the
second section of
the file
this is the
third section of
the file
$ sed ‘s/^$/XXXXX/’ somefile.txt | tr ‘\n’ ‘ ‘ | sed ‘s/XXXXX/\n/g’| sed ‘s/^ //’
this is the first section of the file
this is the second section of the file
this is the third section of the file
|
-
- remove non-alphabetic characters and convert uppercase to lowercase
$ tr -cs “[:alpha:]” ” ” < somefile.txt | tr “[:upper:]” “[:lower:]” |
References
- GNU core utilities
- Using the GNU text utilities
- awk one-liners
- The GNU Awk User’s Guide
- Awk: Dynamic Variables
- How to Use Awk (Hartigan)
- sed one-liners
- sed scripts
- Sed – An Introduction
- Perl one-liners
- Perl one-liners
- Perl regular expressions
- Unix Power Tools, 2nd Ed., O’Reilly
- Linux Cookbook, 2nd Ed., No Starch Press
- Unix in a Nutshell, 3rd Ed., O’Reilly
- John & Ed’s Miscellaneous Unix Tips
- Classic Shell Scripting, O’Reilly — great overview of the Unix philosophy of combining small tools that are each very good at a specific thing