Regular Expressions Before we go into what Regular Expressions (which we will refer to as "RE's" from now on) are, let us clear up a minor problem that can cause a lot of confusion. Certain characters in RE's are called metacharacters or special characters. Examples of such characters are '*' and '?'. Their occurence means something other than their literal value. For example, the pattern 'a*' means all strings beginning with the letter 'a' and consisting of zero or more occurrences of 'a'. It is important to realize that RE's are _not_ the same as wildcards. These are pattern specifiers used by other basic utilities present on Unix systems. For example, the 'ls' program uses wildcards, but the interpretation of '*' in a wildcard expression is different. The command ls aa* means list all files whose filename begins with the string "aa" and is followed by zero or more occurrences of _any_ character. Thus, the string 'aadvark' would match the wildcard expression, but 'apple' would not. But neither 'aadvark' or 'apple' would be matched by the RE 'aa*' because both strings do not consist entirely of the character 'a'. An even more confusing character is the '$' character. The expression '$a' would be interpreted by the shell as the shell variable 'a'. The expression '$a' does not make any sense when used as an RE, since '$' is used to indicate the end of a line. The Unix utility 'sed' uses the '$' character in different ways. In an RE, it has the standard RE meaning - end of line, but when used as a specifier for line numbers, it indicates the last line in a file! The moral of the story is that you need to examine the context in which a special character is being used in order to determine the behaviour it will produce. Typically, a program using regular expressions will mention the fact in its man-page / documentation. Alright. Lets get on with how to specify an RE. The special characters used in RE's are '.', '*', '?', '+', '(', ')', '{', '}', '[', ']', '^', '$', and '|'. All other characters mean exactly what they are. The '.' character will match any single character. The pattern 'a.c' will match the string 'abc'. It will also match 'adc', or 'aqc', or any string beginning with 'a', ending with 'c' and having exactly one character inbetween. The RE '.abc...' would match the word 'labcoat'. Repetition characters : ----------------------- The '?', '*', and the '+' characters are repetition operators. The '?' will match zero or one instance of the preceding item. '*' will match zero or more instances of the preceding item. And '+' will match one or more instances of the preceding item. Thus, the strings 'ac', and 'aac' will be matched by the pattern 'aa?c'. There are no other strings which will be matched by this RE. 'aa*c' however, will match all strings beginning with 'a' and ending with 'c', provided that all characters (if any) between these two are "a"'s. It should be obvious that the RE 'a+' is equivalent to the RE 'aa*'. Example 1 : The RE '.n+..?.?v*.' matches which of the following strings? a. knives e. supernova b. involve f. innovate c. snail g. inactive d. anvil h. nerve The answer is that the first four patterns will match, and the last four wont. Lets discuss the above answer a bit. For example, why would 'snail' match the RE even though it doesnt have a 'v' in it? Because the repetition character for 'v' is '*' and this means that there may be zero or more 'v' characters occurring in the string. Thus after the 'n', if there are no 'v' characters in the string, then any characters (1 to 4 in number) following it would match the RE. The match for 'anvil' is a bit more complicated. The RE seems to indicate that there must be atleast one character between the 'n' and the 'v' in the string being matched - and thus 'anvil' should not match. However, 'v' can also be matched by the '.' in the RE. So the match would take place as shown below : . n+ . .? .? v* . | | | | | | | 'a' | | | | | 'l' | | | | | one instance | | | zero instances | | | one instance of 'v'| 'i' - one instance | zero instances Note that whether the first ',?' repeats zero times or whether it is the second one that does this, the result is the same. If an RE _can_ lead to a match, even though another way of interpreting the pattern would cause a match failure, then the RE would still be said to match the string. 'involve' would match for the same reasons. The 'vol' in 'involve' would match the '..?.?' part of the RE. The same for 'knives'. The 'ives' would match the '..?.?v*.' part of the RE. The reason for the rejects are straightforward. (e) and (h) do not match because of the wrong number of characters that occur before the first 'n' in the string. (f) does not match because there are too many characters after the 'v' - there is no way for '..?.?v*.' to match 'ovate'. (g) does not match because there are too many characters between the 'n' and the 'v'. Example 2 : We are trying to match the string "annotation". What is wrong with the following RE's? a. *nnotation b. an*n c. an?otation d. Annotation (a) is wrong because '*' is a repetition character, and there is no character occurring before it which it repeats. This is a syntactically wrong usage. The correct usage would have been '.*nnotation' or '.nnotation'. (b) and (c) are classic mistakes often made by people who fail to realise that the '*' and '?' characters as used in RE's differ in functionality from the wildcard '*' and '?' characters used in programs like 'ls'. In (b), the '*' means zero or more occurrences of 'n'. Thus, that RE would match 'an', 'ann', or 'annnnnn' - but it will not match 'annotation' since there are characters other than 'n' between 'nn' and the final 'n'. In (c), the '?' means zero or one occurrence of 'n', and not a placeholder for a character (as is the case in 'dir' command of DOS). (d) is wrong simply because RE's are case-sensitive. Lists and ranges : ---------------------------- A list of characters enclosed by '[' and ']' matches any single character in that list. For example, the RE 'a[bcd]e' will match the strings 'abe' or 'ace' or 'ade'. It will not match the string 'abce' because exactly one of the characters enclosed between '[' and ']' can be used for matching. Lists can be specified in a simpler manner by specifying ranges. For example, a shorter way of saying '[0123456789]' is '[0-9]'. '[abcde]' can be specified as '[a-e]'. Note that the '-' character is not a special character : it only assumes a special meaning when it is between two characters in a list. The RE '[-v-x]', for instance, would mean a match on any one of the characters '-', 'v', 'w', or 'x'. Example 3 : Social Security numbers in the US are written in the form nnn-nn-nnnn - where 'n' stands for a single digit. An RE that can match a Social Security number would be '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'. An expression like '[0-9]*-[0-9]*-[0-9]*' would also match a Social Security number, but this RE would also match a string like '3984723-1-2324' or even the string '--' which do not constitute valid Social Security numbers. Example 4 : Lets say that you have a program that uses Regular Expressions to specify search patterns ('egrep' is an example of such a program). One would like to search for all proper nouns occurring in a document. If one makes the assumption that all proper nouns begin with an upper-case alphabetic character, then, an RE to find proper nouns would be : [A-Z][a-z]+ This assumes that we are not searching for acronyms, and that proper nouns are more than one character in length. If one needs to search for acronyms (all upper-cased characters) also, then the RE would be : [A-Z][A-Za-z]+ If the list of characters begins with '^', then it means that the bracketed expression will match every character except the ones listed within the brackets. Thus the RE '[^0-9]+.*' will match any string that doesnt begin with numbers. Note : The technical jargon for the lists described above is 'bracket expressions'. Character classes : ------------------- Lists can sometimes be represented in a standard format called 'character class'. These are better from a documentation standpoint. They are also better when considering code that needs to work in different collating sequences. For example, in EBCDIC, the collating sequence of characters is different from that of ASCII, and the range '[A-Z]' will include a different choice of characters than the ASCII set does. This problem can be averted by using the character class 'upper'. The way to use this is by saying : [[:upper:]] The standard character classes are 'alnum', 'alpha', 'blank', 'cntrl', 'digit' , 'graph', 'lower', 'print', 'punct', 'space', 'upper', and 'xdigit'. These are the same as what you would expect from the is... functions defined in the C standard library (see ctypes.h). Thus, the example 4 solution could be rewritten as : [[:upper:]][[:upper:][:lower:]]+ A character class may not be used as the end-point of a range. Bounds : -------- Bounds can be regarded as a type of repetition character. The a bound can be written as : {a,b} where a, and b are numbers. The ',' and 'b' are optional, but if b is present, then, the ',' must be present. The numbers can take values from 0 to 255, and a <= b. '{a}' means that the preceding list or character will be matched exactly 'a' number of times. '{a,}' means 'a' or more occurrences of the preceding list or character will be matched. And '{a,b}' will match 'a' through 'b' occurrences of the preceding list or character. From the above, it must be obvious that : '{0,1}' is equivalent to '?' . '{1,}' is equivalent to '+' . '{0,}' is equivalent to '*' . Example 5 : The RE in example 3 can be rewritten as : [0-9]{3}-[0-9]{2}-[0-9]{4} or in a more portable way as : [[:digit:]]{3}-[[:digit:]]{2}-[[:digit:]]{4} Escape characters : ------------------- Lets say that you want to use one of the special characters in a literal sense. For example, suppose you want to match strings which contain a '?' character in them. Certainly, the RE '.*?.*' would not work - it is illegal from a syntax point of view, since one cannot apply a repetition character to a repetition character. The way to solve this problem is by "escaping" from the normal sense of the character. This is done by prefixing the character in question with the backslash ('\') character. Thus, the RE, above should be written as : .*\?.* In the context of RE's, the '\' character is referred to as the "escape" character. It is valid for exactly one character. If you are trying to match all strings containing the string '?*', the RE .*\?*.* is illegal and will not work because the '\' applies only to the '?' and not to any characters following it. The correct RE in this case is : .*\?\*.* If the escape character precedes a character that is not a special character, then it will be ignored. Thus '\a' will be interpreted as 'a'. It is also illegal to end an RE with a '\'. To represent a literal '\', write '\\'. All special characters including '\' lose their special sense when enclosed withing '[' and ']' in a list. Thus the RE '[ab*\]' would match any of the characters 'a', 'b', '*' and '\'. There are some caveats to special characters appearing in lists - please see the section "Advanced Topics" for more information on this. Example 6 : ----------- Let us examine some of the RE's that can be written to match the string : Hillary ***scares*** Bill a. 'Hillary.*Bill' : this is the probably the loosest RE that can be written for this string, since it will match anything betweeen the words "Hillary" and "Bill", thus also matching "Hillary likes Bill". b. 'Hillary \**scares\** Bill' : this is a tighter match, but it will match strings like "Hillary scares Bill" and "Hillary *scares* Bill". A slightly better RE would be 'Hillary \*+scares\*+ Bill'. c. 'Hillary \*\*\*scares\*\*\* Bill' is an exact match. d. The RE above can be rewritten as 'Hillary \*{3}scares\*{3} Bill'. Grouping and alternation : -------------------------- Let suppose that you want to match all occurrences of the string : ha-ha-ha-ha-ha One way of doing it is, of course, to write it out just like that. A more compact way of writing it would be to group the '-ha' substring together and specify a suitable repetition factor. This is how it could be done : ha(-ha){4} The '(' and the ')', when used in this manner will cause the repetition factor to apply to the enclosed string. The grouping is essential to specify the OR condition in patterns. If you wanted to match strings which had "Bill" or "Hillary" in them, you would write it as (Bill)|(Hillary) where the '|' character implements the logical OR. Example 7 : Lets say that you have a document which contains URL's in them. You want to match all lines containing the URL's. Here is an RE for it : http://([[:alphanum:]_-]+\.){1,4}[[:alphanum:]_-]+(/[[:alphanum:].]+/*)* where we make the assumption that the URL main site can have at most 5 components. Example 8 : The RE for the full pathname of a file in Unix is : /([[:alphanum:]._+-&*]+/)*([[:alphanum:]._+-&*]+)* where, for convenience sake we assume that some of the special characters (like '[' for instance) cannot be used in a filename. Example 9 : Here is an RE for the name of a person in titular form : (Mr\.)|(Mrs\.)|(Ms) [A-Z][a-z]* (([A-Z][a-z]*)[- ]*)+ Anchors : --------- When placed at the beginning of an RE and not in a list, the '^' character matches the start of line. The '$' character when used at the end of an RE, will match the end of a line. These characters are called "anchors" because they "anchor" the RE to either the beginning or end of a line. Thus the RE '^a.*' will match every line that begins with 'a', and the RE '.*a$' will match every line ending in 'a'. Basic vs Extended Regular Expressions : --------------------------------------- In the infancy of Unix, when regular expressions were introduced, things were simple : there were not as many special characters as there are now. The '|', '+', and the '?' characters did not have any special meanings. Also, the '{', '}', '(', and ')' characters behaved the reverse of how they do now : by themselves, they were ordinary characters. But when escaped, the '{' and '}' acted as delimiters for bounds, and '(' and ')' acted as delimiters for subexpressions. There were also other differences - but this older style of regular expressions are referred to as "Basic regular expressions". By contrast, the RE's that are commonly in use today are called "Extended regular expressions". Newer utilities that are being created conform more to extended RE's than the basic ones. However, some utilities which date from the early days of Unix are still around (for example - grep), and these continue to use basic RE's. The 'egrep' tool was written to provide a version of 'grep' that used extended regular expressions. One important feature present in Basic Regular Expressions but not present in Extended Regular Expressions, is the backreference. This is the '\' character followed by a non-zero digit 'd'. '\d' would match the same sequence of characters matched by the d'th paranthesized subexpression. The subexpressions are numbered by the positions of their opening parantheses, left to right. Backreferences are ambiguous in nature. For example, in the Basic Regular Expression 'a\(\(b\)*\2\)*d' the backreference is part of the first sub-expression which contains the second (which is referred to). Thus it is not clear whether this RE will match 'abbbd'. Example 10 : Here is how the RE for a dollar amount in the million dollar range would be written using Extended Regular Expressions : \$[0-9](,[0-9]{3}){2} when written using Basic Regular Expressions, it would be : $[0-9]\(,[0-9]\{3\}\)\{2\} Note that the '$' does not have to be escaped in the Basic RE because '$' is not a special character unless it appears at the end of an RE or at the end of a sub-expression. Advanced Topics : ----------------- There exists character sets for which multiple byte codes are used to represent a single character. DBCS is an example of this. For such character sets, the difference between a collating element and a character is very important. Since multiple characters represent one character, there could be a problem when one tries to match a collating element using an RE, and one uses lists. For example, if the character sequence 'qz' corresponds to a collating element, the RE '[qz]' will not have the intended effect. The correct way to handle this is to write '[[.qz.]]'. The '[.' and the '.]' bracketing sequence is intended to enclose a character sequence that represents a single collating element. This method can be used to specify the RE when '-' is the start point of a range. '[[.-.]-]' is how this is written. Also, in order to specify ']' as an element in a list, it has to be the first character in the list. Thus the RE '[]a]' is valid and will match either ']' or 'a'. This also means that the RE '[][]' actually matches '[' or ']'. From this, it is evident that there is no way to distinguish the above case from when two empty lists are concatenated together, and hence an empty list is illegal. On the other hand the empty sub-expression '()' is legal and matches the null string. When specifying the exclusion list using '^' and ']' is one of the characters in the exlusion list, the '^' should be written first, followed by ']'. The RE's '[[:<:]]' and '[[:>:]]' are not character classes but are RE's that can be used to match the beginning and end of words, respectively. A word is a string comprising of alphanumeric characters and / or underscores. This is an extension that is compatible with the standard, but is not defined in it. (C) 1999 Kenneth Stephen