What is a UNIX Filter?
Every UNIX command has three parts.
- name Every command has a name. This is an imperative sentence that tells the operating system to do something.
- options Commands can have zero or more options that modify their action.
- arguments These are things the command acts on. Some commands require no arguments, some do.
A UNIX filter is a special type of command which accepts a file as input, processes that file, and then makes a file as output, which by default goes to the screen. Filters traverse a file line by line.
__ \ \---------------------- \ Entrails that O\ \ process the ___\ \ input file |------< INPUT FILE \ | | ---- --O-----------------/ ↓ OUTPUT FILE
These tools allow you to dig for information in a file.
catput a file to the screen
grepsearch a file for a string and print out all lines containing that string
uniqget rid of duplicate neighbors
sortput lines of file in order.
cat does not filter; it just spews the
entire file to the screen. Note: the
unix> is just the
unix> cat chars.txt a b c d e f g h i j k l m n n o p q r s t u v w x y z A B C D E F G H I J K L M N N P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 , . ? ! @ # $ % ^ & * ( ) = ] [
Here we will see an option at work. It puts line numbers in the output file.
unix> cat -n chars.txt 1 a 2 b 3 c 4 d 5 e 6 f 7 g 8 h 9 i 10 j 11 k 12 l 13 m 14 n 15 n 16 o 17 p 18 q 19 r 20 s 21 t 22 u 23 v 24 w 25 x 26 y 27 z 28 29 A 30 B 31 C 32 D 33 E 34 F 35 G 36 H 37 I 38 J 39 K 40 L 41 M 42 N 43 N 44 P 45 Q 46 R 47 S 48 T 49 U 50 V 51 W 52 X 53 Y 54 Z 55 0 56 1 57 2 58 3 59 4 60 5 61 6 62 7 63 8 64 9 65 , 66 . 67 ? 68 ! 69 @ 70 # 71 $ 72 % 73 ^ 74 & 75 * 76 ( 77 ) 78 = 79 ] 80 [
grep needs two arguments.
The first is a search string. The second is a file.
This filter will filter in all lines containing the
search string, and ignore the rest.
a scrabble dictionary. Let's print out all lines containing the
unix> grep COW scrabble.txt BECOWARD BECOWARDED BECOWARDING BECOWARDS COW COWAGE COWAGES COWARD COWARDICE COWARDICES COWARDLINESS COWARDLINESSES COWARDLY COWARDS COWBANE COWBANES COWBELL COWBELLS COWBERRIES COWBERRY COWBIND COWBINDS COWBIRD COWBIRDS COWBOY COWBOYED COWBOYING COWBOYS COWCATCHER COWCATCHERS COWED COWEDLY COWER COWERED COWERING COWERS COWFISH COWFISHES COWFLAP COWFLAPS COWFLOP COWFLOPS COWGIRL . . . PICOWAVED PICOWAVES PICOWAVING SCOW SCOWDER SCOWDERED SCOWDERING SCOWDERS SCOWED SCOWING SCOWL SCOWLED SCOWLER SCOWLERS SCOWLING SCOWLINGLY SCOWLS SCOWS STUCCOWORK STUCCOWORKS
What is a character class?
This is a character wildcard. We will learn about some basic
character classes. Most characters, with the exception of some magic
characters, are wildcards representing only themselves. F'rinstance,
a is just the letter
You can have a range character class. Let's make one. To use a
character class as a search item, we must use the
unix> grep -E '[a-f]' chars.txt a b c d e f
Now let's make a list character class.
You just put the characters you want inside of
[ ... ].
unix> grep -E '[arqs]' chars.txt a q r s
Now look at this mystery.
unix> grep -E '[Z-a]' chars.txt a Z ^ ] [
What is happening here? First, a bit of computer history. In the Fred and Barney days, all code was written using only upper-case letters. Lower case letters were added to the character set later.
Another interesting fact is that the characters
you see (even in plain text) on the screen are illusions.
Every character has a numerical value. English characters
are stored in a single byte. For example the letter
a is 01100001. The letter
01000001. This encoding is called
is also a
Wikipedia article on this encoding.
Character ranges are, in my parlance, asciicographical; to wit, characters are ordered by their numerical (byte) values.
Python allows us to see this correspondence.
ord function, given a one-character
string, will tell you the numerical value for
that character. The
the other way. Here I invoke
for this demonstration.
Python 3.11.5 (main, Sep 29 2023, 18:17:13) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> ord("a") 97 >>> ord("]") 93 >>> ord("Z") 90 >>> chr(91) '[' >>> chr(92) '\\'
A regular expression (regex) is something you build with character classes. Its most basic operation is juxtaposition, which means "and then immediately". a character class is just a one-character regex.
Let us introduce the character class
This character class represents any single character
except for a newline. We will demonstrate some
simple regexes now.
Let's Cheat on an NYT Crossword!We have a few tasty situations you might find in a crossword. Let us introduce two items into regular expressions. The character
^means "beginning of line," and the character
$means "end of line."
Our dictionary file
has one word to a line. We will use the
i option on
to keep it case-insensitive.
Puzzle 1:cat on the spot:
- word starting
- a blank (one character)
- a blank
- a blank
- word ending
all in immediate succession. Now let's build our regex.
- word starting:
- a blank (one character):
- a blank:
- a blank
- word ending
A hunting we will go!
unix> grep -iE '^..el.t$' scrabble.txt EYELET OCELOT OMELET
Our cat on the spot has spots, OCELOT
Puzzle 2:8 lb baby:
unix> grep -iE '^..ed..h.m...$' scrabble.txt SLEDGEHAMMER
A regular sledgehammer is 16 lb. The 8 lb version is a baby sledge.
unix> grep -iE '^g..t$' scrabble.txt GAIT GAST GELT GENT GEST GHAT GIFT GILT GIRT GIST GLUT GNAT GOAT GOUT GRAT GRIT GROT GUST
This looks mysterious until you remember what goats do: they butt heads. The goat is a butter!
Puzzle 4:amative preparation:
grep -iE '^..il..e$' scrabble.txt ANILINE AXILLAE EPILATE FAILURE GHILLIE PHILTRE SOILAGE SOILURE UTILISE UTILIZE
Puzzle 5:nose gutter:
PHILTRE is the love potion.
unix> grep -iE '^p...t..m$' scrabble.txt PHANTASM PHILTRUM PLASTRUM PLECTRUM
PHILTRUM is the word.
Let's Cheat at Keyword in the WaPo!
Here is the game we solved.
Here is what we did. We figured out all letters we could put in the blanks.
Now our regex:
unix> grep -iE '^[bcfghlmoprstv]e[lnst][aio][oy][hr]$' scrabble.txt SENIOR
Hunting in the Dictionary
Here we find all words with three Zs in the scrabble dictionary. The * after the . means "zero or more of." So we look for a Z followed by some characters (possibly none), another Z followed by more characters (possibly none), then a final Z
unix> grep -iE 'z.*z.*z' scrabble.txt BEZAZZ BEZAZZES PAZAZZ PAZAZZES PIZAZZ PIZAZZES PIZAZZY PIZZAZ PIZZAZES PIZZAZZ PIZZAZZES PIZZAZZY RAZZAMATAZZ RAZZAMATAZZES RAZZMATAZZ RAZZMATAZZES ZIZZLE ZIZZLED ZIZZLES ZIZZLING ZYZZYVA ZYZZYVAS ZZZ
We also did this in the bigger file
unix> grep 'z.*z.*z' hughJass.txt benzeneazobenzene bezazz bezazzes drizzle-drozzle fuzzy-guzzy fuzzy-wuzzy mezzo-mezzo pazazz pazazzes pizazz pizazzes pizazzy pizzazz pizzazzes razzle-dazzle razzmatazz zizz zizzle zizzled zizzles zizzling zyzzyva zyzzyvas
Let's Cheat at Wordle!
We began with ADIEU. Wordle told us
we had an A in the right place and the
E was in the wrong place. If you have
a character class such as
[^ABC] is anything except
So our hint gives us the regex ^A..[^E].$
unix> grep -iE '^A..[^E].$' scrabble.txt AALII AARGH ABACA ABACI ABACK ABAFT ABAKA ABAMP ABASE ABASH ABATE ABAYA ABBAS ABBOT ABEAM ABELE ABETS ABHOR ABIDE ABMHO ABODE ABOHM ABOIL ABOMA ABOON ABORT . . . AWOKE AWOLS AXELS AXIAL AXILE AXILS AXING AXIOM AXION AXITE AXMAN AXONE AXONS AYAHS AYINS AZANS AZIDE AZIDO AZINE AZLON AZOIC AZOLE AZONS AZOTE AZOTH AZUKI AZURE
There is a lot of crap we don't want. For instance, AZURE has U. Wordle ruled that out. And AXMAN has no E, so it's a dud.
We can filter for items with an E using
grep E. We can exclude the duds
using grep -v [DIU]. We can connect these with
a pipe (|). Videte et Spectate!
unix> grep -iE '^A..[^E].$' scrabble.txt | grep -v [DIU] ABATE ABEAM ABELE ABOVE ACETA AGAPE AGATE AGAVE AGAZE AGENE AGENT AGONE AKELA AKENE ALANE ALATE ALEPH ALGAE ALONE AMAZE AMBLE AMEBA AMENT AMOLE AMPLE ANELE ANENT ANGLE ANKLE ANOLE ANTAE APACE APEAK APPLE ATONE AWAKE AWOKE AXONE AZOLE AZOTE
Still a lot. Vanna White reminds us of the magic of RSTLNE. We eliminated R and S and O. So we revise our query as follows
unix> grep -iE '^A..[^E][^E]$' scrabble.txt | grep -v [ORSDIU] |grep E ABEAM ACETA AGENT AKELA ALEPH AMEBA AMENT ANENT APEAK
Result was AGENT.
Where do I get this tool?
Go to Getting BASH to find out. This tool is on all Macs, and you can install the UNIX subsystem on Windoze machines.