Topics Map > University of Chicago > IT Services > Applications, Operating Systems, & Devices
Basic Unix - Regular Expressions
This article describes the use of regular expressions in Unix.
Using Regular Expressions
A regular expression is a sequence of characters that represents a pattern. Such a pattern can be a fixed word, like "silly" or "elephantine," or can describe something more general, like "any word starting with the letter `r' and ending in a vowel," or "any five-digit number which doesn't contain a 3." This flexibility makes regular expressions invaluable tools: they can be used to find particular lines in a file, for example, or to instruct a program to take certain actions when presented with a certain text string. A knowledge of them is helpful in using Unix-based editors (like Emacs and vi), almost all Unix search tools (such as grep, sed and awk), and other utilities and applications (like Usenet newsreaders and the procmail mail filter).
In the rules for writing regular expressions below, the wordcharacter means any character except a newline (the "Return" or "Enter" key).
Metacharacters in regular expressions
Regular expressions are constructed by combining ordinary characters (m, a, 3) with a few special characters, or "metacharacters" (*, ^, $). When one of these metacharacters appears in a regular expression, it has a meaning other than its ordinary use as punctuation. Here are the special characters you need to know about (with examples drawn from the search tool `grep,' and killfiles in the newsreader `trn'):
The caret. When a caret appears at the beginning of a regular expression (except when in square brackets), then the string being matched must appear at the beginning of a line (or string). Nothing may precede the match -- even spaces and tabs can't appear earlier on the line.
grep '^Jefferson' presidents-list
would turn up "Jefferson, Thomas," but not "Clinton, William Jefferson." (It is good practice to put quotes around search strings when using `grep,' to ensure that they are not interpreted by the shell.)
In a trn killfile, the line:
would junk (`j') all articles with From: headers listing a specific address.
The dollar sign. The opposite of the caret: when a dollar sign appears at the end of a regular expression, the string being matched must appear at the end of a line (or string).
grep 'gry$' /usr/dict/words
would display every word in the online dictionary (/usr/dict/words) that ends in "gry." Whereas the command:
grep '^$' somefile | wc
would count the number of blank lines (wc) in a file.
The period. This represents any character. The regular expression "p.n" matches the words "pin," "pan," "pen," or any string where a `p' and an `n' appear, separated by one character. For instance,
grep 'Peters.n' class.roster
would show all Petersons and all Petersens in the hypothetical file "class.roster."
The asterisk. Any one-character expression followed by an asterisk will match that character zero or more times. The regular expression "his*" matches the strings "his," "hiss," or "hissssssssssss" -- but also "hi" (an `h,' followed by an `i,' followed by zero `s' characters).
grep 'tech*support' mbox
would locate references to "techsupport" as well as "technical advisors and support staff," so long as "tech" preceded "support" on the same line.
In a trn killfile, the line:
would junk articles crossposted to four or more newsgroups (the names of which are separated by colons in an Xref header).
The backslash. This means "turn off the special meaning of the next character." For example, the expression "\$11\.00" would match only the string "$11.00", and would not try to interpret the to mean "the end of a line," or the period to mean "any character here." If you need to match a backslash, precede it with another backslash ( ) to turn off its special meaning.
In a trn killfile, the line:
would junk articles with the string "$$" in their subject lines (which are often messages offering large amounts of money to unsuspecting readers: $$$$$ MAKE MONEY FA$T $$$$$).
Square brackets. These represent any single character contained between the square brackets. The expression p[iae]n means: the letter `p,' followed by any of the letters `i,' `a' or `e,' followed by the letter `n.' This expression will match the words "pin" or "pan" or "pen," but not "pun." It will also not match "pain," even though both `a' and `i' appear between the square brackets.
Square brackets may also be used to match a range of characters, when two characters between the brackets are separated by a hyphen. So the regular expression "[A-Z]" means "any uppercase English letter"; "[a-fh-z]" matches any lowercase letter except for `g'; "[0-9]" matches any number (remember that 0 precedes 1 in the ASCII character set).
If the first character between the brackets is a caret, it's used to mean "not these characters"; any characterexcept one falling between the brackets can match. The expression "[^12345]" matches any character, except for the numbers one through five. (Naturally, when ^ appears here it doesn't mean "match the beginning of a line.") This rule can combine with the previous rule to produce even more complicated character ranges: "[^A-Za-z0-9]" means "any character that is neither a letter nor a number."
grep '606[0-9][0-9]' address-book
would list everyone in a Chicago zip code.
In a trn killfile,
would junk articles whose subject lines contained no lowercase letters (`c' is a case-sensitive search).
For more information
Rules are derived from the ed(1) man pages; some examples are based on The Unix Programming Environment, by Brian W. Kernighan and Rob Pike. You can consult both to learn more.
More on regular expressions appears in:
- A Tutorial Introduction to the ed Text Editor, Brian W. Kernighan.
- Advanced Editing on Unix, Brian W. Kernighan.
- The ex(1), sed(1), and grep(1) man pages. (Other tools, such as `awk,' `egrep,' and `perl,' use regular expressions in extended or nonstandard ways; it's best to start with the basics.)
- The printed Berkeley Unix manuals.