Regular Expressions 


Before we go into what Regular Expressions (which we will refer to as "RE's"
from now on) are, let us clear up a minor problem that can cause a lot of 
confusion. 

Certain characters in RE's are called metacharacters or special characters. 
Examples of such characters are '*' and '?'. Their occurence means something 
other than their literal value. For example, the pattern 'a*' means all 
strings beginning with the letter 'a' and consisting of zero or more 
occurrences of 'a'.

It is important to realize that RE's are _not_ the same as wildcards. These 
are pattern specifiers used by other basic utilities present on Unix systems.
For example, the 'ls' program uses wildcards, but the interpretation of '*' in
a wildcard expression is different. The command 

	ls aa* 

	means list all files whose filename begins with the string "aa" and is
followed by zero or more occurrences of _any_ character. Thus, the string 
'aadvark' would match the wildcard expression, but 'apple' would not. But  
neither 'aadvark' or 'apple' would be matched by the RE 'aa*' because both 
strings do not consist entirely of the character 'a'.

An even more confusing character is the '$' character. The expression '$a' 
would be interpreted by the shell as the shell variable 'a'. The expression 
'$a' does not make any sense when used as an RE, since '$' is used to indicate 
the end of a line. The Unix utility 'sed' uses the '$' character in different 
ways. In an RE, it has the standard RE meaning - end of line, but when used as
a specifier for line numbers, it indicates the last line in a file!

The moral of the story is that you need to examine the context in which a 
special character is being used in order to determine the behaviour it will 
produce. Typically, a program using regular expressions will mention the fact
in its man-page / documentation.

Alright. Lets get on with how to specify an RE.

The special characters used in RE's are '.', '*', '?', '+', '(', ')', '{', 
'}', '[', ']', '^', '$',  and '|'. All other characters mean exactly what 
they are. 

The '.' character will match any single character. The pattern 'a.c' will 
match the string 'abc'. It will also match 'adc', or 'aqc', or any string 
beginning with 'a', ending with 'c' and having exactly one character 
inbetween. The RE '.abc...' would match the word 'labcoat'.


Repetition characters :
-----------------------

The '?', '*', and the '+' characters are repetition operators. The '?' will
match zero or one instance of the preceding item. '*' will match zero or more
instances of the preceding item. And '+' will match one or more instances of 
the preceding item.

Thus, the strings 'ac', and 'aac' will be matched by the pattern 'aa?c'. There 
are no other strings which will be matched by this RE. 'aa*c' however, will 
match all strings beginning with 'a' and ending with 'c', provided that all 
characters (if any) between these two are "a"'s.

It should be obvious that the RE 'a+' is equivalent to the RE 'aa*'.

Example 1 :

	The RE '.n+..?.?v*.' matches which of the following strings?

a.	knives		e.	supernova
b.	involve		f.	innovate
c.	snail		g.	inactive
d.	anvil		h.	nerve

	The answer is that the first four patterns will match, and the last
four wont.

Lets discuss the above answer a bit. For example, why would 'snail' match the
RE even though it doesnt have a 'v' in it? Because the repetition character 
for 'v' is '*' and this means that there may be zero or more 'v' characters 
occurring in the string. Thus after the 'n', if there are no 'v' characters in 
the string, then any characters (1 to 4 in number) following it would match the 
RE.

The match for 'anvil' is a bit more complicated. The RE seems to indicate that 
there must be atleast one character between the 'n' and the 'v' in the string 
being matched - and thus 'anvil' should not match. However, 'v' can also be 
matched by the '.' in the RE. So the match would take place as shown below :

	.   n+  .  .?  .?  v*  .
        |   |   |  |   |   |   |
       'a'  |   |  |   |   |  'l'
            |   |  |   |   |   
   one instance |  |   | zero instances
                |  |   | 
one instance of 'v'| 'i' - one instance
                   |
            zero instances

Note that whether the first ',?' repeats zero times or whether it is the 
second one that does this, the result is the same. If an RE _can_ lead to
a match, even though another way of interpreting the pattern would cause a 
match failure, then the RE would still be said to match the string.

'involve' would match for the same reasons. The 'vol' in 'involve' would match 
the '..?.?' part of the RE. The same for 'knives'. The 'ives' would match the 
'..?.?v*.' part of the RE.
 
The reason for the rejects are straightforward. (e) and (h) do not match 
because of the wrong number of characters that occur before the first 'n' in 
the string. (f) does not match because there are too many characters after the 
'v' - there is no way for '..?.?v*.' to match 'ovate'. (g) does not match 
because there are too many characters between the 'n' and the 'v'.

Example 2 :

We are trying to match the string "annotation". What is wrong with the 
following RE's?

a.	*nnotation
b.	an*n
c.	an?otation
d.	Annotation

(a) is wrong because '*' is a repetition character, and there is no character 
occurring before it which it repeats. This is a syntactically wrong usage. The 
correct usage would have been '.*nnotation' or '.nnotation'.

(b) and (c) are classic mistakes often made by people who fail to realise that 
the '*' and '?' characters as used in RE's differ in functionality from the 
wildcard '*' and '?' characters used in programs like 'ls'. In (b), the '*' 
means zero or more occurrences of 'n'. Thus, that RE would match 'an', 'ann', 
or 'annnnnn' - but it will not match 'annotation' since there are characters 
other than 'n' between 'nn' and the final 'n'. In (c), the '?' means zero or 
one occurrence of 'n', and not a placeholder for a character (as is the case 
in 'dir' command of DOS).

(d) is wrong simply because RE's are case-sensitive.


Lists and ranges :
----------------------------

A list of characters enclosed by '[' and ']' matches any single character in 
that list. For example, the RE 'a[bcd]e' will match the strings 'abe' or 'ace' 
or 'ade'. It will not match the string 'abce' because exactly one of the 
characters enclosed between '[' and ']' can be used for matching.

Lists can be specified in a simpler manner by specifying ranges. For example, 
a shorter way of saying '[0123456789]' is '[0-9]'. '[abcde]' can be specified 
as '[a-e]'.

Note that the '-' character is not a special character : it only assumes a 
special meaning when it is between two characters in a list. The RE '[-v-x]', 
for instance, would mean a match on any one of the characters '-', 'v', 'w', 
or 'x'.

Example 3 :

Social Security numbers in the US are written in the form nnn-nn-nnnn - where 
'n' stands for a single digit. An RE that can match a Social Security number 
would be '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'.

An expression like '[0-9]*-[0-9]*-[0-9]*' would also match a Social Security 
number, but this RE would also match a string like '3984723-1-2324'
or even the string '--' which do not constitute valid Social Security 
numbers.

Example 4 :

Lets say that you have a program that uses Regular Expressions to specify
search patterns ('egrep' is an example of such a program). One would like to
search for all proper nouns occurring in a document. If one makes the 
assumption that all proper nouns begin with an upper-case alphabetic
character, then, an RE to find proper nouns would be :

	[A-Z][a-z]+                   

This assumes that we are not searching for acronyms, and that proper nouns are 
more than one character in length. If one needs to search for acronyms (all 
upper-cased characters) also, then the RE would be :

	[A-Z][A-Za-z]+

If the list of characters begins with '^', then it means that the bracketed
expression will match every character except the ones listed within the
brackets. Thus the RE '[^0-9]+.*' will match any string that doesnt begin with
numbers.

Note : The technical jargon for the lists described above is 'bracket 
expressions'.


Character classes :
-------------------

Lists can sometimes be represented in a standard format called 'character 
class'. These are better from a documentation standpoint. They are also better
when considering code that needs to work in different collating sequences.
For example, in EBCDIC, the collating sequence of characters is different from
that of ASCII, and the range '[A-Z]' will include a different choice of 
characters than the ASCII set does. This problem can be averted by using the
character class 'upper'. The way to use this is by saying :

	[[:upper:]]

The standard character classes are 'alnum', 'alpha', 'blank', 'cntrl', 'digit'
, 'graph', 'lower', 'print', 'punct', 'space', 'upper', and 'xdigit'. These 
are the same as what you would expect from the is... functions defined in the 
C standard library (see ctypes.h).

Thus, the example 4 solution could be rewritten as :

	[[:upper:]][[:upper:][:lower:]]+

A character class may not be used as the end-point of a range.


Bounds :
--------

Bounds can be regarded as a type of repetition character. The a bound can be
written as :

	{a,b}
	where a, and b are numbers. The ',' and 'b' are optional, but if b is 
present, then, the ',' must be present. The numbers can take values from 0 to 
255, and a <= b.

	'{a}' means that the preceding list or character will be matched 
exactly 'a' number of times. '{a,}' means 'a' or more occurrences of the 
preceding list or character will be matched. And '{a,b}' will match 'a' 
through 'b' occurrences of the preceding list or character.

	From the above, it must be obvious that :

	'{0,1}' is equivalent to '?' .
	'{1,}' is equivalent to '+' .
	'{0,}' is equivalent to '*' .


Example 5 :

The RE in example 3 can be rewritten as :

	[0-9]{3}-[0-9]{2}-[0-9]{4}

or in a more portable way as :

	[[:digit:]]{3}-[[:digit:]]{2}-[[:digit:]]{4}


Escape characters :
-------------------

Lets say that you want to use one of the special characters in a literal 
sense. For example, suppose you want to match strings which contain a '?'
character in them. Certainly, the RE '.*?.*' would not work - it is illegal
from a syntax point of view, since one cannot apply a repetition character to
a repetition character.

The way to solve this problem is by "escaping" from the normal sense of the
character. This is done by prefixing the character in question with the 
backslash ('\') character. Thus, the RE, above should be written as :

	.*\?.*

In the context of RE's, the '\' character is referred to as the "escape"
character. It is valid for exactly one character. If you are trying to match
all strings containing the string '?*', the RE

	.*\?*.*

	is illegal and will not work because the '\' applies only to the '?'
and not to any characters following it. The correct RE in this case is :

	.*\?\*.*

If the escape character precedes a character that is not a special character,
then it will be ignored. Thus '\a' will be interpreted as 'a'. It is also
illegal to end an RE with a '\'. 

To represent a literal '\', write '\\'.

All special characters including '\' lose their special sense when enclosed
withing '[' and ']' in a list. Thus the RE '[ab*\]' would match any of the
characters 'a', 'b', '*' and '\'. There are some caveats to special characters
appearing in lists - please see the section "Advanced Topics" for more 
information on this.

Example 6 :
-----------

Let us examine some of the RE's that can be written to match the string :

	Hillary ***scares*** Bill

	a.  'Hillary.*Bill' : this is the probably the loosest RE that can be
	written for this string, since it will match anything betweeen the
	words "Hillary" and "Bill", thus also matching "Hillary likes Bill".
	b.  'Hillary \**scares\** Bill' : this is a tighter match, but it will
	match strings like "Hillary scares Bill" and "Hillary *scares* Bill".
	A slightly better RE would be 'Hillary \*+scares\*+ Bill'.
	c.  'Hillary \*\*\*scares\*\*\* Bill' is an exact match.
	d.  The RE above can be rewritten as 'Hillary \*{3}scares\*{3} Bill'.


Grouping and alternation :
--------------------------

Let suppose that you want to match all occurrences of the string :

	ha-ha-ha-ha-ha 

One way of doing it is, of course, to write it out just like that. A more
compact way of writing it would be to group the '-ha' substring together and
specify a suitable repetition factor. This is how it could be done :

	ha(-ha){4}

The '(' and the ')', when used in this manner will cause the repetition factor
to apply to the enclosed string.

	The grouping is essential to specify the OR condition in patterns. If
you wanted to match strings which had "Bill" or "Hillary" in them, you would
write it as 

	(Bill)|(Hillary)

	where the '|' character implements the logical OR.

Example 7 :

Lets say that you have a document which contains URL's in them. You want to
match all lines containing the URL's. Here is an RE for it :

    http://([[:alphanum:]_-]+\.){1,4}[[:alphanum:]_-]+(/[[:alphanum:].]+/*)*

	where we make the assumption that the URL main site can have at most
5 components.

Example 8 :

The RE for the full pathname of a file in Unix is :

	/([[:alphanum:]._+-&*]+/)*([[:alphanum:]._+-&*]+)*

	where, for convenience sake we assume that some of the special 
characters (like '[' for instance) cannot be used in a filename.

Example 9 :

Here is an RE for the name of a person in titular form :

	(Mr\.)|(Mrs\.)|(Ms) [A-Z][a-z]* (([A-Z][a-z]*)[- ]*)+


Anchors :
---------

When placed at the beginning of an RE and not in a list, the '^' character
matches the start of line. The '$' character when used at the end of an RE,
will match the end of a line. These characters are called "anchors" because
they "anchor" the RE to either the beginning or end of a line.

Thus the RE '^a.*' will match every line that begins with 'a', and the RE 
'.*a$' will match every line ending in 'a'.


Basic vs Extended Regular Expressions :
---------------------------------------

In the infancy of Unix, when regular expressions were introduced, things were
simple : there were not as many special characters as there are now. The '|',
'+', and the '?' characters did not have any special meanings. Also, the '{',
'}', '(', and ')' characters behaved the reverse of how they do now : by 
themselves, they were ordinary characters. But when escaped, the '{' and '}' 
acted as delimiters for bounds, and '(' and ')' acted as delimiters for
subexpressions. There were also other differences - but this older style of 
regular expressions are referred to as "Basic regular expressions".

By contrast, the RE's that are commonly in use today are called "Extended 
regular expressions". Newer utilities that are being created conform more to 
extended RE's than the basic ones. However, some utilities which date from 
the early days of Unix are still around (for example - grep), and these 
continue to use basic RE's. The 'egrep' tool was written to provide a version
of 'grep' that used extended regular expressions.

One important feature present in Basic Regular Expressions but not present in
Extended Regular Expressions, is the backreference. This is the '\' character
followed by a non-zero digit 'd'. '\d' would match the same sequence of 
characters matched by the d'th paranthesized subexpression. The subexpressions
are numbered by the positions of their opening parantheses, left to right.

Backreferences are ambiguous in nature. For example, in the Basic Regular
Expression 'a\(\(b\)*\2\)*d' the backreference is part of the first 
sub-expression which contains the second (which is referred to). Thus it is
not clear whether this RE will match 'abbbd'.

Example 10 :

Here is how the RE for a dollar amount in the million dollar range would be
written using Extended Regular Expressions :

	\$[0-9](,[0-9]{3}){2}

	when written using Basic Regular Expressions, it would be :

	$[0-9]\(,[0-9]\{3\}\)\{2\}

Note that the '$' does not have to be escaped in the Basic RE because '$' is
not a special character unless it appears at the end of an RE or at the end of
a sub-expression.


Advanced Topics :
-----------------

There exists character sets for which multiple byte codes are used to 
represent a single character. DBCS is an example of this. For such character
sets, the difference between a collating element and a character is very
important. Since multiple characters represent one character, there could be
a problem when one tries to match a collating element using an RE, and one 
uses lists.

For example, if the character sequence 'qz' corresponds to a collating
element, the RE '[qz]' will not have the intended effect. The correct way to
handle this is to write '[[.qz.]]'. The '[.' and the '.]' bracketing sequence
is intended to enclose a character sequence that represents a single collating
element.

This method can be used to specify the RE when '-' is the start point of a 
range. '[[.-.]-<range endpoint char>]' is how this is written.

Also, in order to specify ']' as an element in a list, it has to be the first
character in the list. Thus the RE '[]a]' is valid and will match either ']'
or 'a'. This also means that the RE '[][]' actually matches '[' or ']'. From
this, it is evident that there is no way to distinguish the above case from 
when two empty lists are concatenated together, and hence an empty list is
illegal.

On the other hand the empty sub-expression '()' is legal and matches the null
string. 

When specifying the exclusion list using '^' and ']' is one of the characters
in the exlusion list, the '^' should be written first, followed by ']'.

The RE's '[[:<:]]' and '[[:>:]]' are not character classes but are RE's that
can be used to match the beginning and end of words, respectively. A word is
a string comprising of alphanumeric characters and / or underscores. This is
an extension that is compatible with the standard, but is not defined in it.


(C) 1999 Kenneth Stephen