lexical analyzer lab 1 Solution

An objective CS 236 is to help you write complex programs by using mathematical concepts as the basis for solving real world problems through computation. Mathematical thinking leads to programs that are clear, organized, and verifiably correct. This first projects shows you how discrete math structures form the basis, not only for writing the code, but for the code itself. It thus also provides a clear guide to maintaining and extending the code by any other programmer who also understands the discrete math structures.

 

You are expected to write code from scratch that clearly emulates the discrete math structures on which the solution is built. You are expected to justify your design and your code as one that is maintainable and extendable by other programmers who understand and are conversant with discrete mathematical structures. You should also learn that for complex programs, if you hack the code or formulate your code off of the discrete math structures, it will take you longer to write, it is less likely to be correct.

 

Please be aware that the project standards apply to this project and all projects in this course, and in particular for this projects, you may not use a regular expression library.

 

Description



 

A lexical analyzer groups characters in an input stream into tokens. A token is a sequence of one or more characters that form a single element of a language (e.g., a symbol, a numerical value, a string literal, or a keyword).

 

The project is to write a lexical analyzer for a subset of the Datalog language [http://en.wikipedia.org/wiki/Datalog]. The Lexer should output the UNDEFINED token if an undefined symbol is found. The input to your Lexer is a text file and the output is a print‑out of formatted tokens explained below. Please refer to the project standards for details on how to indicate the program input and where the output should be directed.

 

A token object is comprised of:

 

a string ‑ extracted from the file text



 

a number ‑ the line the token started on



 

a token type ‑ listed below



 

The lexical analyzer must use finite state machines to recognize tokens in the input streams. It is recommended that you consider using enumerated types to enumerate the potential states of the machine.

 

Token Types

 

You are to generate tokens from a text input stream for a subset of Datalog. The list of all token types that you will be expected to identify is below. You need to define the token types in your code.



 

http://wiki.cs.byu.edu/cs-236/lexical-analyzer                                                                                                                                                                                                                                                         1/5

3/23/2015
 
cs-236:lexical-analyzer[CS W iki]
 
 
 
 
Token Type
 
Description
 
Examples
 
 
 
 
 
 
COMMA
 
The ',' character
 
 
,
 
PERIOD
 
The '.' character
 
 
.
 
Q_MARK
 
The '?' character
 
 
?
 
LEFT_PAREN
 
The '(' character
 
 
(
 
RIGHT_PAREN
 
The ')' character
 
 
)
 
COLON
 
The ':' character
 
 
:
 
COLON_DASH
 
The string “:‑”
 
 
:‑
 
MULTIPLY
 
The ' * ' character
 
 
*
 
ADD
 
The ' + ' character
 
 
+
 
SCHEMES
 
The string “Schemes”
 
Schemes
 
FACTS
 
The string “Facts”
 
Facts
 
RULES
 
The string “Rules”
 
Rules
 
QUERIES
 
The string “Queries”
 
Queries
 
 
 
 
 
valid
invalid
 
ID
 
A letter followed by 0 or more letters or digits and is not a keyword (Schemes, Facts, Rules, Queries)
 
Identifier1
1stPerson
 
 
 
Person
Person_Name
 
 
 
 
 
 
 
 
 
 
 
Schemes
 
 
 
Any sequence of characters enclosed in single quotes. Two single quotes denote an apostrophe within the string. For line‑number
 
'This is a string'
 
STRING
 
counts, count all \n's within a string. A string token’s line number is the line where the string starts. If EOF is found before the end
 
' ' (The empty string)
 
 
 
of the string, it is undefined (see UNDEFINED token below).
 
 
 
 
 
 
 
'this isn''t two strings'
 
 
 
 
 
 
 
 
A line comment starts with # and ends at the next newline or EOF.
 
# This is a comment
 
 
 
 
 
#| This is a tricky
 
COMMENT
 
A block comment starts with a #| and ends with a |#. The comment's line number is the line where the comment started. Like
 
multiline comment |#
 
 
 
#| This is an illegal block
 
 
 
STRING, if EOF is found before the end of the block comment, it is UNDEFINED (see UNDEFINED token below).
 
comment
 
 
 
 
 
 
because it ends with EOF
 
 
 
 
 
 
 
 
WHITESPACE
 
Any character recognized by the isspace() function defined in ctype.h. Though we give it a token type here, these tokens will be
 
 
 
 
 
ignored and not saved. Be sure to count the line numbers when skipping over white space.
 
 
 
 
 
 
 
 
 
 
 
 
Any character not tokenized as a string, keyword, identifier, symbol, or white space is undefined. Additionally, any non‑
 
$&^   (Three   undefined
 
 
 
 
tokens)
 
 
UNDEFINED
 
terminating string or non‑terminating block comment is undefined. In both of the latter cases we reach EOF before finding the end
 
 
 
 
 
'any string that does not
 
 
 
of the string or the end of the block comment.
 
 
 
 
 
 
end
 
 
EOF
 
End of input file
 
 
 
 

Output Format



 

 

 

http://wiki.cs.byu.edu/cs-236/lexical-analyzer                                                                                                                                                                                                                                                         2/5

3/23/2015                                                                                                                                    cs-236:lexical-analyzer[CS W iki]

 

For this and all subsequent labs, you MUST match the formatting requirements exactly. Automated testing systems are used to check your code (similar to those used by professional software design companies) and it is desired that you experience writing code that matches a given protocol.

 

The expected output is the representation of a token, printed one per line, followed by a single line that states how many tokens there were. You should NOT print any information for a WHITESPACE token. Format tokens as follows:

 

( T o k e n _ T y p e , " V a l u e     "     ,     L     i     n     e     _     N     u m b e r )

 

Notice there are no spaces on either side of the commas separating the three token's elements. The token type should appear in this list, spelled exactly as specified (all capital letters). The text is the value of the token surrounded by double quotes. You should have only one token per line, and the last line should look like:

 

T o t a l   T  o k e     n     s             =             #     #             #

 

where ### is replaced by the number of tokens found (excluding whitespace). ALL OUTPUT IS CASE SENSITIVE. See the example below:

 

Example 1



 

For input (the line numbers are not part of the input–they are to help clarify new lines)



 

0
1
.
 
 
0
2
,
 
s   t   r   i   n   g   '
0
3
'   a
0
4
#   A
c   o   m   m   e   n   t
0
5
S
c
h   e   m   e   s
0
6
F   a   c   t   s   R   u   l   e   s
0
7
:
:
-
0
8
 
 
 


 

The Lexer should output



 

(   P   E   R   I   O   D   ,   "   .   "   ,   1   )
 
(   C   O   M   M   A   ,   "   ,   "   ,   2   )
 
(   S   T   R   I   N   G   ,   "   '   a
s   t   r   i   n   g   '   "   ,   3   )
(   C   O   M   M   E   N   T   ,   "   #   A
c   o   m   m   e   n   t   "   ,   4   )
(   S   C   H   E   M   E   S   ,   "   S   c   h   e   m   e   s   "   ,   5   )
(   I   D   ,   "   F   a   c   t   s   R   u   l   e   s   "   ,   6   )
(   C   O   L   O   N   ,   "   :   "   ,   7   )
 
(   C   O   L   O   N   _   D   A   S   H   ,   "   :   -   "   ,   7   )
(   E   O   F   ,
"   "   ,   8   )
=
9
T   o   t   a   l
T   o   k   e   n   s


 

Example 2



 

Input



 

http://wiki.cs.byu.edu/cs-236/lexical-analyzer                                                                                                                                                                                                                                                         3/5



3/23/2015
cs-236:lexical-analyzer[CS W iki]
0
1
Q   u   e   r   i   e   s   :
H   (   '   S   n   o   o   p   y   '   ,   R   ,   '   M   '   ,   H   )
0
2
I   s   I   n   R   o   o   m   A   t   D
0
3
#   S   c   h   e   m   e   s   F   a   c
t   s   R   u   l   e   s   <
0
4.
=
0
5
#   |   c   o   m   m   e   n   t
0
6
w   o   w   |   #
 
0
7
 
 


 

Output



 

( Q U E R I E S , " Q     u      e      r      i      e      s      "      ,      1              )



 

( C O L O N ,     "      :      "      ,      1              )



( I D , " I s I n R o     o      m      A      t      D      H      "      ,      2              )



( L E F T _ P A R  E      N      ,      "      (      "      ,      2              )



( S T R I N G , " ' S     n      o      o      p      y      '      "      ,      2              )



( C O M M A ,     "      ,      "      ,      2              )



(
I
D
,
"
R
"
,
2
)
(   C   O   M   M   A   ,   "   ,   "   ,   2   )
(   S   T   R   I   N   G   ,   "   '   M   '   "   ,   2   )
(   C   O   M   M   A   ,   "   ,   "   ,   2   )
(
I
D
,
"
H
"
,
2
)


(
R
I
G
H
T
_
P
A
R
E
N
,
"
)
"
,
2
)
a   c   t   s   R   u   l   e   s   <   "   ,   3   )
(
C
O
M
M
E
N
T
,
"
#
S
c
h
e
m
e
s
F


(   P   E   R   I   O   D   ,   "   .   "   ,   4   )
 
e
n   t
   =
(   C   O   M   M   E   N   T   ,   "   #   |   c   o   m   m
w
o
w
|
#
"
,
5
)
)
 
 
 
 
 
 
(
E
O
F
,
"
"
,
7
n   s
=
1
6
 
 
T   o   t   a   l
 
T   o   k   e
 
 


 

Example 3



 

Input



 

0
1
Q
u
e
r
i
e   s
:
 
 
 
 
0
2
I   s   I   n   R   o   o   m   A   t   D   H   (   '   S   n   o   o   p   y   '   ,   R   ,   '   M   '   H   )   ?
0
3
R   u   l   e   s
F   a   c   t   s
S   c   h   e   m   e   s
 
0
4
&   @   Q   u   e   r   i   e   s
 
 
 
 
0
5
:
h
e
l
l
o
 
 
 
 
 
0
6
'
A
'   t   h   i   s
h   a   s
a
0
7
I
e
a   m   '
$
0
8
R
t
u
r
n
 
 
n   e   a   r
 
 
0
9
T   h   e
 
e   n   d   '   '   s
 
 
 
1
0
 
 
 
 
 
 
 
 
 
 
 


 

Output



 

(   Q   U   E   R   I   E   S   ,   "   Q   u   e   r   i   e   s   "   ,   1   )
(   C   O   L   O   N   ,   "   :   "   ,   1
)
 
 
 
 
 
 
)
(   I   D   ,   "   I   s   I   n   R   o   o   m   A   t   D   H   "   ,   2
(
L
E
F
T
_
P
A
R
E
N
,   "   (
"
,
2
)
,   2
)
(
S
T
R
I
N
G
,
"
'
S
n
o
o
p
y
'
"


(   C   O   M   M   A   ,   "   ,   "   ,
2
)
(
I
D
,
"
R
"
,
2
)
,
2
)
(
C
O
M
M
A
,
"
,
"


 

http://wiki.cs.byu.edu/cs-236/lexical-analyzer                                                                                                                                                                                                                                                         4/5

3/23/2015
 
 
cs-236:lexical-analyzer[CS W iki]
(   S   T   R   I   N
G   ,   "   '   M   '   "   ,   2   )
 
(   I   D   ,   "
H
"   ,   2   )
 
(   R   I   G   H   T
_   P   A   R   E   N   ,   "   )   "   ,   2   )
(   Q   _   M   A   R
K   ,   "   ?   "   ,   2   )
)
(   R   U   L   E   S
,   "   R   u   l   e   s   "   ,   3


(   F   A   C   T   S   ,   "   F   a   c   t   s   "   ,   3   )
 
 
(   S   C   H   E   M   E   S   ,   "   S   c   h   e   m   e   s   "   ,   3   )
 
(   U   N   D   E   F   I   N   E   D   ,   "   &   "   ,   4   )
 
 
(   U   N   D   E   F   I   N   E   D   ,   "   @   "   ,   4   )
 
 
(   Q   U   E   R   I   E   S   ,   "   Q   u   e   r   i   e   s   "   ,   4   )
 
(   C   O   L   O   N   ,   "   :   "   ,   5   )
 
 
 
 
(   S   T   R   I   N   G   ,   "   '   h   e   l   l   o
 
 
 
I
 
a
m
'
"
,
6
)
 
 
 
 
 
(   U   N   D   E   F   I   N   E   D   ,   "   $   "   ,   7   )
 
 
(
I
D
,
"
A
"
,
7   )
 
 
 
h   a   s
a
(   U   N   D   E   F   I   N   E   D   ,   "   '   t   h   i   s
R
e
t
u
r
n
 
 
 
n   e   a   r
 
 
 
T   h   e
)
e   n   d   '   '   s
 
 
 
"
,
7
 
 
 
 
 
 
 
 
 
 
(   E   O   F   ,   "   "   ,   1   0   )
 
=
2   4
 
 
T   o   t   a   l
 
T   o   k   e   n   s
 
 


 

Submission



 

Review the project standards for creating a zip archive. You submission cannot be scored if you fail to create the archive correct.

 

Navigate to Learning Suite [http://learningsuite.byu.edu], select 'CS 236', and click on 'Assignments' on the course home page. Click on the link to the relevant project and at the bottom of the description click on 'View/Submit'. Use the resulting upload dialog to upload your zip archive.

 

Pass‑off



 

Pass‑off your project directly to a TA during normal TA hours after your archive is uploaded to learningsuite [http://learningsuite.byu.edu].

 

TAs help students on a first‑come/first‑serve basis and are under no obligation to stay later than designated hours so plan accordingly.

 

Please review the syllabus for the pass‑off requirements.

 

cs‑236/lexical‑analyzer.txt · Last modified: 2015/02/23 17:14 by egm

 
Powered by