|
There are no translations available.
A simple word grammarThe following simple word grammar has three keywords, a word rule to match identifiers, and a word rule to match numbers: KEYWORDS
after and break
USER-DEFINED WORDS
Identifier: [ab][ab01]* Number: [01]+
For the sake of simplicity, we omit any rules for comments or strings. Further, we simplify the identifier's letter range from [A-Za-z] to [ab] and the number's digit range from [0-9] to [01]. It does not affect the algorithms involved; but it does reduce the size of the resulting state transition table. Word rules
We can derive the atomic word rules for the keywords as: T{R1:[0] a [1] f [2] t [3] e [4] r [5 MATCH] } T{R2:[0] a [1] n [2] d [3 MATCH] } T{R3:[0] b [1] r [2] e [3] a [4] k [5 MATCH] }
For the word rules defining the permissible user-defined words, we must apply reduction rules, in order to get rid of the Kleene's closures in the word rules: T{R4:[0] [ab] [1] [ab01]* [2 MATCH] } =T{R4:[0] [ab] [1 MATCH] } + T{R4:[0] [ab] [1] [ab01] [2 MATCH] } + T{R4:[0] [ab] [1] [ab01] [2] [ab01] [1 MATCH] }
T{R5:[0] [01]+ [1 MATCH] } =T{R5:[0] [01] [1 MATCH] } + T{R5:[0] [01] [1] [01] [1 MATCH] }
Atomic word rules
After applying the reduction rules to the Kleene's closures, we obtain the collection of atomic word rules: T{R1:[0] a [1] f [2] t [3] e [4] r [5 MATCH] } T{R2:[0] a [1] n [2] d [3 MATCH] } T{R3:[0] b [1] r [2] e [3] a [4] k [5 MATCH] } T{R41:[0] [ab] [1 MATCH] } T{R42:[0] [ab] [1] [ab01] [2 MATCH] } T{R43:[0] [ab] [1] [ab01] [2] [ab01] [2 MATCH] } T{R51:[0] [01] [1 MATCH] } T{R52:[0] [01] [1] [01] [1 MATCH] } Deriving the state transitions from the atomic word rulesAlgorithm: - Start from the combined state zero in all atomic word rules;
- List each follow letter in each rule state in the combined state (the follow set for a combined state);
- For each follow letter, determine what the next combined state will be, when following that letter.
- If there is no transition available for a rule for a follow letter, eject the rule out of the next combined state;
- Go back to 2.
(R1-0,R2-0,R3-0,R41-0,R42-0,R43-0,R51-0,R52-0) a --> (R1-1,R2-1,R41-1 MATCH, R42-1, R43-1) b --> (R3-1,R41-1 MATCH, R42-1, R43-1) 01 --> (R51-1,R52-1)
(R1-1,R2-1,R41-1 MATCH, R42-1, R43-1) f --> (R1-2) n --> (R2-2) ab01 --> (R42-2 MATCH, R43-2)
(R1-2) t --> (R1-3)
(R1-3) e --> (R1-4)
(R1-4) r --> (R1-5 MATCH)
(R2-2) d --> (R2-3 MATCH)
(R42-2 MATCH, R43-2) ab01 --> (R43-1 MATCH)
(R43-1) ab01 --> (R43-2 MATCH)
(R43-2 MATCH) ab01 --> (R43-2 MATCH)
(R3-1,R41-1 MATCH, R42-1, R43-1) r --> (R3-2) ab01 --> (R42-2 MATCH, R43-2)
(R51-1,R52-1) 01 --> (R52-1 MATCH)
(R52-1 MATCH) 01 --> (R52-1 MATCH)
Renumbered statesEach combined state derived above, obtains its own state number: 0=(R1-0,R2-0,R3-0,R41-0,R42-0,R43-0,R51-0,R52-0) 1 MATCH R4=(R1-1,R2-1,R41-1 MATCH, R42-1, R43-1) 2 MATCH R4=(R3-1,R41-1 MATCH, R42-1, R43-1) 3=(R51-1,R52-1) 4=(R1-2) 5=(R2-2) 6 MATCH R4=(R42-2 MATCH, R43-2) 7=(R1-3) 8=(R1-4) 9 MATCH R1=(R1-5 MATCH) 10 MATCH R2=(R2-3 MATCH) 11=(R43-1) 12 MATCH R4=(R43-2 MATCH) 13=(R3-2) 14 MATCH R5=(R52-1 MATCH) Renumbered transitionsWe now replace the combined states by their corresponding numbers in the state transitions, in order to obtain the following state transition table: (0;a;1 MATCH R4) (0;b;2 MATCH R4) (0;0;3) (0;1;3) (1;f;4) (1;n;5) (1;a;6 MATCH R4) (1;b;6 MATCH R4) (1;0;6 MATCH R4) (1;1;6 MATCH R4) (5;t;7) (7;e;8) (8;r;9 MATCH R1) (5;d;10 MATCH R2) (6;a;11) (6;b;11) (6;0;11) (6;1;11) (11;a;12 MATCH R4) (11;b;12 MATCH R4) (11;0;12 MATCH R4) (11;1;12 MATCH R4) (12;a;12 MATCH R4) (12;b;12 MATCH R4) (12;0;12 MATCH R4) (12;1;12 MATCH R4) (2;r;13) (2;a;6 MATCH R4) (2;b;6 MATCH R4) (2;0;6 MATCH R4) (2;1;6 MATCH R4) (3;0;14 MATCH R5) (3;1;14 MATCH R5) (14;0;14 MATCH R5) (14;1;14 MATCH R5)
Execution rule for the scanner
- The initial state is state zero.
- In the current state, if the next letter in the character stream causes a valid transition, go to the next state, regardless of whether the current state is a match or not (=greed);
- In the current state, if the next letter does not cause a valid transition and there is a match for a rule: reduce the rule matched;
- In the current state, if the next letter does not cause a valid transition and there is no match for any rule: fail (lexer error);
- Repeat this in the next state.
Representing atomic rules programmatically An atomic rule contains no regular expression' control structures and is of the following format:
Ri: {<state>,<charclass>}<state>
For example, in the notation used above:
R7:[0] a [1] [xyz] [2] t [3] [0-9] [4] r [5 MATCH]
A character class is a collection of characters and character ranges. A character is any letter limited to ascii lower range, such as 'a', 'k', '0', or '\n'(=newline) or other lower range ascii characters. A character range is a collection of characters in which the lower bound is separated from the higher bound by a hyphen. For example, the character range 'a-d' means the collection of the letters 'a', 'b', 'c', and 'd'.
We will represent a character class in the atomic rules by strings representing all member characters in the character class (expansion).
For example, the character class 'acfg-knpA-CZ' becomes after expansion: 'acfghijknpABCZ'.
We will represent a state, by a number, optionally followed by the letter 'M' to indicate that the state leads to a match for the rule.
The atomic rule for a keyword, is simply an initial zero state, followed by each letter in the keyword followed by a state. For example, for the keyword 'select':
atomicRule('select') == {'0','s','1','e','2' ,'l','2','e','3','c','4','t','5M' }
Another example, for a randomly constructed atomic word rule. Even though the rule is a bit nonsensical, it is still a valid atomic word rule:
atomicRule('somerule') == {'0','sabxyz','1','eff','2' ,'lt','2','e','3','cdddm','4M'}
The atomic word rule is the most essential data structure in the lexer-generator algorithm described above. Representing states programmaticallyA rule state is a cursor position inside one single rule. For example, R2-5, represents the position in state 5 of rule R2.
A rule state is a tuple (name,state):
<ruleState>: (<name>,<stateNumber>)
A combined state represents a cursor position inside multiple rules simultaneously. For example, (R2-3,R6-2), represents a simultaneous cursor position in rule R2 and in rule R6.
Therefore, a combined state is a collection of rule states.
<combinedState>: { <ruleState> }
The regular expression reduction process, however, will reduce each non-atomic word rule into a collection of atomic word rules. Therefore, we need to keep track of both the non-atomic name and the atomic name in the combined state:
<ruleState>: (<nonAtomicName>,<atomicName>,<stateNumber>)
For example, the rule state R41-1 MATCH becomes:
ruleState=('R4','R41', '1M')
Therefore, the combined state (R1-1,R2-1,R41-1 MATCH, R42-1, R43-1) will be represented programmatically as:
combinedState=((('R1','R1','1'),('R2','R2','1'),('R4','R41','1M'),('R4','R42','1'),('R4','R43','1'))
RenumberingA state transition is a tuple representing a transition between two combined states, such as for example: (1;a;6 MATCH R4).
<transition>: (<stateFrom>,<character>,<stateTo> [,{<ruleName>}])
A state transition could match more than one word rule; and this would indeed be ambiguous.
Therefore, we can represent the following state transition (1;a;6 MATCH R4) programmatically as:
transition=('1','a','6', ('R4'))
Object model for (1) the state transition table generator and (2) the scanner
The algorithm described above, is data-structure heavy. In order to implement correctly the algorithms for both the state transition table generator and the scanner -- which will use the table generated -- we need to model their classes properly. In a next blog post, I will try to propose an object model for both programs.
|