# Assignment #2 Regular expressions SOLUTION ATTACHED

Introduction

Regular expressions (abbreviated to regex, the pronunciation of which gives rise to endless ame wars. . . ) are used in various programming languages and utilities to match entire classes of strings. This assignment will give you experience modelling a regular expression as a tree, and detecting which strings match a given regular expression.
We won't favour any particular regex language or utility, such as Python, Java, or Linux's grep, but use a stripped-down simpli ed regular expression form that contains all the essential principles. In particular, you are not permitted to import Python's regular expression module are into any file you submit.
We'll only be dealing with a ternary alphabet, i.e., f0; 1; 2g. Generalizing to an arbitrary alphabet is straightforward, but doesn't buy us much for the purpose of this assignment. The elementary regex symbols we'll use can be easily typed in Python source code. A regex (over ternary alphabet f0; 1; 2g) is a nonempty
string made up of the following symbols.
'0' called zero
'1' one
'2' two
'e' pronounced \ee'' or \epsilon''
'|' bar
'.' dot
'*' star
'(' left parenthesis, or left
')' right parenthesis, or right
1
There are several rules determining that a string made up of these symbols is a valid regular expression:
1. There are 4 regexes of length one. They are:
 '0'
 '1'
 '2'
 'e'
2. If r is a regular expression, then so is r + '*', where the plus symbol '+' means string concatenation, as in Python.
3. If r1 and r2 are regexes, then so are:
 '(' + r1 + '|' + r2 + ')'
 '(' + r1 + '.' + r2 + ')'
Here are some examples of regexes:
 '0'
 '1'
 '2'
 'e'
 '0*'
 '1*'
 '2*'
 'e*'
 '(0|1)'
 '(1.2)'
 '(e|0)'
 '(2.e)'
 '(0*|2*)'
 '((0.1).2)'
 '((1.(0|2)*).0)'

Tree Representation of Regexes

Every regex can be (uniquely) represented as a tree. Each leaf node contains exactly one of '0', '1', '2', or 'e'. Each internal node contains exactly one of '.', '|', or '*'.
A regex of length one is represented by a tree of one node containing the symbol in the regex. For example, the regex '0' is represented by the tree whose root is the leaf node containing '0'.
A regex of the form r + '*' is represented by a tree whose root node contains '*', and that node has one child which is the tree that represents the regex r. E.g., the regex '1*' is represented by the following tree:
'1'
'*'
A regex of the form '(' + r1 + '|' + r2 + ')' is represented by a tree whose root node contains '|', and that node has left and right children which are the trees that represent the regexes r1 and r2 respectively.
For example, the regex '(0|1)' is represented by the following tree:
'0' '1'
  TT
'|'
A regex of the form '(' + r1 + '.' + r2 + ')' is represented just as a regex of the form '(' + r1 + '|' + r2 + ')', except the root now contains '.' rather than '|'. For example, the regex '((0.1).0)' is represented by the following tree:
'0' '1'
  TT
'.' '0'
 
l
l
'.'
Here's an example that combines all concepts from above. The regex '((1.(0|1)*).0)' is represented by the following tree:
'1'
'0' '1'
  TT
'|'
'*'
,
, TT
'.' '0'
 
bbb
'.'

Matching Strings with Regexes

A ternary string is a string (possibly empty) that contains only the symbols '0', '1' and '2'. For a regex r and a ternary string s, we de ne below what it means for r to match s. (Equivalently we may also say thats matches r, or that r and s match).

1. A regex of length one matches exactly one string. Speci cally:
 the regex '0' matches the string '0'
 the regex '1' matches the string '1'
 the regex '2' matches the string '2'
 the regex 'e' matches the string '' (i.e., 'e' matches the empty string)

2. A regex of the form r + '*' matches string s if and only if either
(a) s equals '' (empty string), or
(b) s has the form s1 + s2 +    + sk where k 0 and r matches every si
For example, the regex '0*' matches any string (possibly empty) that contains no other symbols than the symbol '0'

3. A regex of the form '(' + r1 + '|' + r2 + ')' matches string s if and only if:
(a) r1 matches s, or
(b) r2 matches s, or
(c) both of the above
For example, the regex '(2|0*)' matches the string '2' as well as any string that contains only the symbol '0'

4. A regex of the form '(' + r1 + '.' + r2 + ')' matches a string s if and only if there are two strings s1 and s2 (each possibly empty) such that
(a) s is the concatenation of s1 and s2 (i.e., s equals s1 + s2), and
(b) r1 matches s1, and
(c) r2 matches s2.
For example, the regex '(1*.2)' matches any string that contains zero or more '1's, followed by exactly one '2'.
Here's an example that combines all concepts from above. The regex '((1.(0|1)*).2)' matches any string that starts with '1' and ends with '2', and has any number of '0's and '1's in between (including nothing in between).