# lexical analyzer lab 1 Solution

An objective CS 236 is to help you write complex programs by using mathematical concepts as the basis for solving real world problems through computation. Mathematical thinking leads to programs that are clear, organized, and verifiably correct. This first projects shows you how discrete math structures form the basis, not only for writing the code, but for the code itself. It thus also provides a clear guide to maintaining and extending the code by any other programmer who also understands the discrete math structures.

You are expected to write code from scratch that clearly emulates the discrete math structures on which the solution is built. You are expected to justify your design and your code as one that is maintainable and extendable by other programmers who understand and are conversant with discrete mathematical structures. You should also learn that for complex programs, if you hack the code or formulate your code off of the discrete math structures, it will take you longer to write, it is less likely to be correct.

Please be aware that the project standards apply to this project and all projects in this course, and in particular for this projects, you may not use a regular expression library.

Description

A lexical analyzer groups characters in an input stream into tokens. A token is a sequence of one or more characters that form a single element of a language (e.g., a symbol, a numerical value, a string literal, or a keyword).

The project is to write a lexical analyzer for a subset of the Datalog language [http://en.wikipedia.org/wiki/Datalog]. The Lexer should output the UNDEFINED token if an undefined symbol is found. The input to your Lexer is a text file and the output is a print‑out of formatted tokens explained below. Please refer to the project standards for details on how to indicate the program input and where the output should be directed.

A token object is comprised of:

a string ‑ extracted from the file text

a number ‑ the line the token started on

a token type ‑ listed below

The lexical analyzer must use finite state machines to recognize tokens in the input streams. It is recommended that you consider using enumerated types to enumerate the potential states of the machine.

Token Types

You are to generate tokens from a text input stream for a subset of Datalog. The list of all token types that you will be expected to identify is below. You need to define the token types in your code.

http://wiki.cs.byu.edu/cs-236/lexical-analyzer                                                                                                                                                                                                                                                         1/5

3/23/2015

cs-236:lexical-analyzer[CS W iki]

Token Type

Description

Examples

COMMA

The ',' character

,

PERIOD

The '.' character

.

Q_MARK

The '?' character

?

LEFT_PAREN

The '(' character

(

RIGHT_PAREN

The ')' character

)

COLON

The ':' character

:

COLON_DASH

The string “:‑”

:‑

MULTIPLY

The ' * ' character

*

The ' + ' character

+

SCHEMES

The string “Schemes”

Schemes

FACTS

The string “Facts”

Facts

RULES

The string “Rules”

Rules

QUERIES

The string “Queries”

Queries

valid
invalid

ID

A letter followed by 0 or more letters or digits and is not a keyword (Schemes, Facts, Rules, Queries)

Identifier1
1stPerson

Person
Person_Name

Schemes

Any sequence of characters enclosed in single quotes. Two single quotes denote an apostrophe within the string. For line‑number

'This is a string'

STRING

counts, count all \n's within a string. A string token’s line number is the line where the string starts. If EOF is found before the end

' ' (The empty string)

of the string, it is undefined (see UNDEFINED token below).

'this isn''t two strings'

A line comment starts with # and ends at the next newline or EOF.

# This is a comment

#| This is a tricky

COMMENT

A block comment starts with a #| and ends with a |#. The comment's line number is the line where the comment started. Like

multiline comment |#

#| This is an illegal block

STRING, if EOF is found before the end of the block comment, it is UNDEFINED (see UNDEFINED token below).

comment

because it ends with EOF

WHITESPACE

Any character recognized by the isspace() function defined in ctype.h. Though we give it a token type here, these tokens will be

ignored and not saved. Be sure to count the line numbers when skipping over white space.

Any character not tokenized as a string, keyword, identifier, symbol, or white space is undefined. Additionally, any non‑

\$&^   (Three   undefined

tokens)

UNDEFINED

terminating string or non‑terminating block comment is undefined. In both of the latter cases we reach EOF before finding the end

'any string that does not

of the string or the end of the block comment.

end

EOF

End of input file

Output Format

http://wiki.cs.byu.edu/cs-236/lexical-analyzer                                                                                                                                                                                                                                                         2/5

3/23/2015                                                                                                                                    cs-236:lexical-analyzer[CS W iki]

For this and all subsequent labs, you MUST match the formatting requirements exactly. Automated testing systems are used to check your code (similar to those used by professional software design companies) and it is desired that you experience writing code that matches a given protocol.

The expected output is the representation of a token, printed one per line, followed by a single line that states how many tokens there were. You should NOT print any information for a WHITESPACE token. Format tokens as follows:

( T o k e n _ T y p e , " V a l u e     "     ,     L     i     n     e     _     N     u m b e r )

Notice there are no spaces on either side of the commas separating the three token's elements. The token type should appear in this list, spelled exactly as specified (all capital letters). The text is the value of the token surrounded by double quotes. You should have only one token per line, and the last line should look like:

T o t a l   T  o k e     n     s             =             #     #             #

where ### is replaced by the number of tokens found (excluding whitespace). ALL OUTPUT IS CASE SENSITIVE. See the example below:

Example 1

For input (the line numbers are not part of the input–they are to help clarify new lines)

0
1
.

0
2
,

s   t   r   i   n   g   '
0
3
'   a
0
4
#   A
c   o   m   m   e   n   t
0
5
S
c
h   e   m   e   s
0
6
F   a   c   t   s   R   u   l   e   s
0
7
:
:
-
0
8

The Lexer should output

(   P   E   R   I   O   D   ,   "   .   "   ,   1   )

(   C   O   M   M   A   ,   "   ,   "   ,   2   )

(   S   T   R   I   N   G   ,   "   '   a
s   t   r   i   n   g   '   "   ,   3   )
(   C   O   M   M   E   N   T   ,   "   #   A
c   o   m   m   e   n   t   "   ,   4   )
(   S   C   H   E   M   E   S   ,   "   S   c   h   e   m   e   s   "   ,   5   )
(   I   D   ,   "   F   a   c   t   s   R   u   l   e   s   "   ,   6   )
(   C   O   L   O   N   ,   "   :   "   ,   7   )

(   C   O   L   O   N   _   D   A   S   H   ,   "   :   -   "   ,   7   )
(   E   O   F   ,
"   "   ,   8   )
=
9
T   o   t   a   l
T   o   k   e   n   s

Example 2

Input

http://wiki.cs.byu.edu/cs-236/lexical-analyzer                                                                                                                                                                                                                                                         3/5

3/23/2015
cs-236:lexical-analyzer[CS W iki]
0
1
Q   u   e   r   i   e   s   :
H   (   '   S   n   o   o   p   y   '   ,   R   ,   '   M   '   ,   H   )
0
2
I   s   I   n   R   o   o   m   A   t   D
0
3
#   S   c   h   e   m   e   s   F   a   c
t   s   R   u   l   e   s   <
0
4.
=
0
5
#   |   c   o   m   m   e   n   t
0
6
w   o   w   |   #

0
7

Output

( Q U E R I E S , " Q     u      e      r      i      e      s      "      ,      1              )

( C O L O N ,     "      :      "      ,      1              )

( I D , " I s I n R o     o      m      A      t      D      H      "      ,      2              )

( L E F T _ P A R  E      N      ,      "      (      "      ,      2              )

( S T R I N G , " ' S     n      o      o      p      y      '      "      ,      2              )

( C O M M A ,     "      ,      "      ,      2              )

(
I
D
,
"
R
"
,
2
)
(   C   O   M   M   A   ,   "   ,   "   ,   2   )
(   S   T   R   I   N   G   ,   "   '   M   '   "   ,   2   )
(   C   O   M   M   A   ,   "   ,   "   ,   2   )
(
I
D
,
"
H
"
,
2
)

(
R
I
G
H
T
_
P
A
R
E
N
,
"
)
"
,
2
)
a   c   t   s   R   u   l   e   s   <   "   ,   3   )
(
C
O
M
M
E
N
T
,
"
#
S
c
h
e
m
e
s
F

(   P   E   R   I   O   D   ,   "   .   "   ,   4   )

e
n   t
=
(   C   O   M   M   E   N   T   ,   "   #   |   c   o   m   m
w
o
w
|
#
"
,
5
)
)

(
E
O
F
,
"
"
,
7
n   s
=
1
6

T   o   t   a   l

T   o   k   e

Example 3

Input

0
1
Q
u
e
r
i
e   s
:

0
2
I   s   I   n   R   o   o   m   A   t   D   H   (   '   S   n   o   o   p   y   '   ,   R   ,   '   M   '   H   )   ?
0
3
R   u   l   e   s
F   a   c   t   s
S   c   h   e   m   e   s

0
4
&   @   Q   u   e   r   i   e   s

0
5
:
h
e
l
l
o

0
6
'
A
'   t   h   i   s
h   a   s
a
0
7
I
e
a   m   '
\$
0
8
R
t
u
r
n

n   e   a   r

0
9
T   h   e

e   n   d   '   '   s

1
0

Output

(   Q   U   E   R   I   E   S   ,   "   Q   u   e   r   i   e   s   "   ,   1   )
(   C   O   L   O   N   ,   "   :   "   ,   1
)

)
(   I   D   ,   "   I   s   I   n   R   o   o   m   A   t   D   H   "   ,   2
(
L
E
F
T
_
P
A
R
E
N
,   "   (
"
,
2
)
,   2
)
(
S
T
R
I
N
G
,
"
'
S
n
o
o
p
y
'
"

(   C   O   M   M   A   ,   "   ,   "   ,
2
)
(
I
D
,
"
R
"
,
2
)
,
2
)
(
C
O
M
M
A
,
"
,
"

http://wiki.cs.byu.edu/cs-236/lexical-analyzer                                                                                                                                                                                                                                                         4/5

3/23/2015

cs-236:lexical-analyzer[CS W iki]
(   S   T   R   I   N
G   ,   "   '   M   '   "   ,   2   )

(   I   D   ,   "
H
"   ,   2   )

(   R   I   G   H   T
_   P   A   R   E   N   ,   "   )   "   ,   2   )
(   Q   _   M   A   R
K   ,   "   ?   "   ,   2   )
)
(   R   U   L   E   S
,   "   R   u   l   e   s   "   ,   3

(   F   A   C   T   S   ,   "   F   a   c   t   s   "   ,   3   )

(   S   C   H   E   M   E   S   ,   "   S   c   h   e   m   e   s   "   ,   3   )

(   U   N   D   E   F   I   N   E   D   ,   "   &   "   ,   4   )

(   U   N   D   E   F   I   N   E   D   ,   "   @   "   ,   4   )

(   Q   U   E   R   I   E   S   ,   "   Q   u   e   r   i   e   s   "   ,   4   )

(   C   O   L   O   N   ,   "   :   "   ,   5   )

(   S   T   R   I   N   G   ,   "   '   h   e   l   l   o

I

a
m
'
"
,
6
)

(   U   N   D   E   F   I   N   E   D   ,   "   \$   "   ,   7   )

(
I
D
,
"
A
"
,
7   )

h   a   s
a
(   U   N   D   E   F   I   N   E   D   ,   "   '   t   h   i   s
R
e
t
u
r
n

n   e   a   r

T   h   e
)
e   n   d   '   '   s

"
,
7

(   E   O   F   ,   "   "   ,   1   0   )

=
2   4

T   o   t   a   l

T   o   k   e   n   s

Submission

Review the project standards for creating a zip archive. You submission cannot be scored if you fail to create the archive correct.

Navigate to Learning Suite [http://learningsuite.byu.edu], select 'CS 236', and click on 'Assignments' on the course home page. Click on the link to the relevant project and at the bottom of the description click on 'View/Submit'. Use the resulting upload dialog to upload your zip archive.

Pass‑off

Pass‑off your project directly to a TA during normal TA hours after your archive is uploaded to learningsuite [http://learningsuite.byu.edu].

TAs help students on a first‑come/first‑serve basis and are under no obligation to stay later than designated hours so plan accordingly.

Please review the syllabus for the pass‑off requirements.