This was a class I gave on Alumni Day at NCSSM in October of 2023. Much of the audience consisted of graduates of the 20-year class of 2003, along with a motley assortment of other brigands. I was requested to do this by Jenna Ingersoll '03.
What is a UNIX Filter?
Every UNIX command has three parts.
- name Every command has a name. This is an imperative sentence that tells the operating system to do something.
- options Commands can have zero or more options that modify their action.
- arguments These are things the command acts on. Some commands require no arguments, some do.
A UNIX filter is a special type of command which accepts a file as input, processes that file, and then makes a file as output, which by default goes to the screen. Filters traverse a file line by line.
__ \ \---------------------- \ Entrails that O\ \ process the ___\ \ input file |------< INPUT FILE \ | | ---- --O-----------------/ ↓ OUTPUT FILE
These tools allow you to dig for information in a file.
cat
put a file to the screengrep
search a file for a string and print out all lines containing that stringuniq
get rid of duplicate neighborssort
put lines of file in order.
The filter cat
does not filter; it just spews the
entire file to the screen. Note: the unix>
is just the
system prompt.
unix> cat chars.txt a b c d e f g h i j k l m n n o p q r s t u v w x y z A B C D E F G H I J K L M N N P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 , . ? ! @ # $ % ^ & * ( ) = ] [
Here we will see an option at work. It puts line numbers in the output file.
unix> cat -n chars.txt 1 a 2 b 3 c 4 d 5 e 6 f 7 g 8 h 9 i 10 j 11 k 12 l 13 m 14 n 15 n 16 o 17 p 18 q 19 r 20 s 21 t 22 u 23 v 24 w 25 x 26 y 27 z 28 29 A 30 B 31 C 32 D 33 E 34 F 35 G 36 H 37 I 38 J 39 K 40 L 41 M 42 N 43 N 44 P 45 Q 46 R 47 S 48 T 49 U 50 V 51 W 52 X 53 Y 54 Z 55 0 56 1 57 2 58 3 59 4 60 5 61 6 62 7 63 8 64 9 65 , 66 . 67 ? 68 ! 69 @ 70 # 71 $ 72 % 73 ^ 74 & 75 * 76 ( 77 ) 78 = 79 ] 80 [
The filter grep
needs two arguments.
The first is a search string. The second is a file.
This filter will filter in all lines containing the
search string, and ignore the rest.
The file scrabble.txt
is
a scrabble dictionary. Let's print out all lines containing the
word COW.
unix> grep COW scrabble.txt BECOWARD BECOWARDED BECOWARDING BECOWARDS COW COWAGE COWAGES COWARD COWARDICE COWARDICES COWARDLINESS COWARDLINESSES COWARDLY COWARDS COWBANE COWBANES COWBELL COWBELLS COWBERRIES COWBERRY COWBIND COWBINDS COWBIRD COWBIRDS COWBOY COWBOYED COWBOYING COWBOYS COWCATCHER COWCATCHERS COWED COWEDLY COWER COWERED COWERING COWERS COWFISH COWFISHES COWFLAP COWFLAPS COWFLOP COWFLOPS COWGIRL . . . PICOWAVED PICOWAVES PICOWAVING SCOW SCOWDER SCOWDERED SCOWDERING SCOWDERS SCOWED SCOWING SCOWL SCOWLED SCOWLER SCOWLERS SCOWLING SCOWLINGLY SCOWLS SCOWS STUCCOWORK STUCCOWORKS
What is a character class?
This is a character wildcard. We will learn about some basic
character classes. Most characters, with the exception of some magic
characters, are wildcards representing only themselves. F'rinstance,
the character a
is just the letter a
.
You can have a range character class. Let's make one. To use a
character class as a search item, we must use the -E
option for grep
.
unix> grep -E '[a-f]' chars.txt a b c d e f
Now let's make a list character class.
You just put the characters you want inside of
[ ... ]
.
unix> grep -E '[arqs]' chars.txt a q r s
Now look at this mystery.
unix> grep -E '[Z-a]' chars.txt a Z ^ ] [
What is happening here? First, a bit of computer history. In the Fred and Barney days, all code was written using only upper-case letters. Lower case letters were added to the character set later.
Another interesting fact is that the characters
you see (even in plain text) on the screen are illusions.
Every character has a numerical value. English characters
are stored in a single byte. For example the letter
a
is 01100001. The letter A
is 01000001
. This encoding is called
ASCII. There
is also a
Wikipedia article on this encoding.
Character ranges are, in my parlance, asciicographical; to wit, characters are ordered by their numerical (byte) values.
Python allows us to see this correspondence.
The ord
function, given a one-character
string, will tell you the numerical value for
that character. The chr
function
goes
the other way. Here I invoke
Python
for this demonstration.
Python 3.11.5 (main, Sep 29 2023, 18:17:13) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> ord("a") 97 >>> ord("]") 93 >>> ord("Z") 90 >>> chr(91) '[' >>> chr(92) '\\'
Mystery solved.
Regexes
A regular expression (regex) is something you build with character classes. Its most basic operation is juxtaposition, which means "and then immediately". a character class is just a one-character regex.
Let us introduce the character class .
This character class represents any single character
except for a newline. We will demonstrate some
simple regexes now.
Let's Cheat on an NYT Crossword!
We have a few tasty situations you might find in a crossword. Let us introduce two items into regular expressions. The character^
means "beginning of line,"
and the character $
means
"end of line."
Our dictionary file scrabble.txt
has one word to a line. We will use the
i
option on grep
to keep it case-insensitive.
Puzzle 1:
cat on the spot:--el-t
We have
- word starting
- a blank (one character)
- a blank
- an
e
- an
l
- a blank
- a
t
- word ending
all in immediate succession. Now let's build our regex.
- word starting:
^
- a blank (one character):
^.
- a blank:
^..
- an
e
:^..e
- an
l
:^..el
- a blank
^
:^..el.
- a
t
:^..el.t
- word ending
^
:^..el.t$
A hunting we will go!
unix> grep -iE '^..el.t$' scrabble.txt EYELET OCELOT OMELET
Our cat on the spot has spots, OCELOT
Puzzle 2:
8 lb baby:--ed--h-m---
unix> grep -iE '^..ed..h.m...$' scrabble.txt SLEDGEHAMMER
A regular sledgehammer is 16 lb. The 8 lb version is a baby sledge.
Puzzle 3:
butter:g--t
unix> grep -iE '^g..t$' scrabble.txt GAIT GAST GELT GENT GEST GHAT GIFT GILT GIRT GIST GLUT GNAT GOAT GOUT GRAT GRIT GROT GUST
This looks mysterious until you remember what goats do: they butt heads. The goat is a butter!
Puzzle 4:
amative preparation:--il--e
grep -iE '^..il..e$' scrabble.txt ANILINE AXILLAE EPILATE FAILURE GHILLIE PHILTRE SOILAGE SOILURE UTILISE UTILIZE
Puzzle 5:
nose gutter:p...t..m
PHILTRE is the love potion.
unix> grep -iE '^p...t..m$' scrabble.txt PHANTASM PHILTRUM PLASTRUM PLECTRUM
PHILTRUM is the word.
Let's Cheat at Keyword in the WaPo!
Here is the game we solved.
Here is what we did. We figured out all letters we could put in the blanks.
-at: bcefghlmopqrstv
wer_: e
pa-try: lnst
t-me: aio
ph-to: oy
c-eep: hr
Now our regex: ^[bcefghlmopqrstv]e[lns][oi][oy][hr]$
unix> grep -iE '^[bcfghlmoprstv]e[lnst][aio][oy][hr]$' scrabble.txt SENIOR
Boom.
Hunting in the Dictionary
Here we find all words with three Zs in the scrabble dictionary. The * after the . means "zero or more of." So we look for a Z followed by some characters (possibly none), another Z followed by more characters (possibly none), then a final Z
unix> grep -iE 'z.*z.*z' scrabble.txt BEZAZZ BEZAZZES PAZAZZ PAZAZZES PIZAZZ PIZAZZES PIZAZZY PIZZAZ PIZZAZES PIZZAZZ PIZZAZZES PIZZAZZY RAZZAMATAZZ RAZZAMATAZZES RAZZMATAZZ RAZZMATAZZES ZIZZLE ZIZZLED ZIZZLES ZIZZLING ZYZZYVA ZYZZYVAS ZZZ
We also did this in the bigger file hughJass.txt
.
unix> grep 'z.*z.*z' hughJass.txt benzeneazobenzene bezazz bezazzes drizzle-drozzle fuzzy-guzzy fuzzy-wuzzy mezzo-mezzo pazazz pazazzes pizazz pizazzes pizazzy pizzazz pizzazzes razzle-dazzle razzmatazz zizz zizzle zizzled zizzles zizzling zyzzyva zyzzyvas
Let's Cheat at Wordle!
We began with ADIEU. Wordle told us
we had an A in the right place and the
E was in the wrong place. If you have
a character class such as [ABC]
, the
character class [^ABC]
is anything except
ABC.
So our hint gives us the regex ^A..[^E].$
unix> grep -iE '^A..[^E].$' scrabble.txt AALII AARGH ABACA ABACI ABACK ABAFT ABAKA ABAMP ABASE ABASH ABATE ABAYA ABBAS ABBOT ABEAM ABELE ABETS ABHOR ABIDE ABMHO ABODE ABOHM ABOIL ABOMA ABOON ABORT . . . AWOKE AWOLS AXELS AXIAL AXILE AXILS AXING AXIOM AXION AXITE AXMAN AXONE AXONS AYAHS AYINS AZANS AZIDE AZIDO AZINE AZLON AZOIC AZOLE AZONS AZOTE AZOTH AZUKI AZURE
There is a lot of crap we don't want. For instance, AZURE has U. Wordle ruled that out. And AXMAN has no E, so it's a dud.
We can filter for items with an E using
grep E
. We can exclude the duds
using grep -v [DIU]. We can connect these with
a pipe (|). Videte et Spectate!
unix> grep -iE '^A..[^E].$' scrabble.txt | grep -v [DIU] ABATE ABEAM ABELE ABOVE ACETA AGAPE AGATE AGAVE AGAZE AGENE AGENT AGONE AKELA AKENE ALANE ALATE ALEPH ALGAE ALONE AMAZE AMBLE AMEBA AMENT AMOLE AMPLE ANELE ANENT ANGLE ANKLE ANOLE ANTAE APACE APEAK APPLE ATONE AWAKE AWOKE AXONE AZOLE AZOTE
Still a lot. Vanna White reminds us of the magic of RSTLNE. We eliminated R and S and O. So we revise our query as follows
unix> grep -iE '^A..[^E][^E]$' scrabble.txt | grep -v [ORSDIU] |grep E ABEAM ACETA AGENT AKELA ALEPH AMEBA AMENT ANENT APEAK
Result was AGENT.
Where do I get this tool?
Go to Getting BASH to find out. This tool is on all Macs, and you can install the UNIX subsystem on Windoze machines.