Rainforest

Sankuru

Implementeren, customiseren, uitbreiden, en troubleshooten van Joomla/Virtuemart

Views: 693

Wij helpen met ...

Virtuemart
Joomfish
Andere extensies
SocialTwist Tell-a-Friend

Automatische vertaling

English Arabic Chinese (Simplified) German Japanese Russian Spanish



Hergebruik open source

Datgene wat U nodig hebt, bestaat vaak al, en dekt 80% van Uw behoeften. Wij zorgen voor de ontbrekende 20%.

Gratis offerte

Vraag vandaag nog gratis een offerte aan.

Building a tokenizer for the EAV problem, part 2 PDF Afdrukken E-mail
Waardering: / 0
SlechtZeer goed 
Geschreven door erik   
donderdag 28 mei 2009 07:04
There are no translations available.

A simple word grammar

The following simple word grammar has three keywords, a word rule to match identifiers, and a word rule to match numbers:

KEYWORDS

after
and
break

USER-DEFINED WORDS

Identifier: [ab][ab01]*
Number: [01]+

For the sake of simplicity, we omit any rules for comments or strings. Further, we simplify the identifier's letter range from [A-Za-z] to [ab] and the number's digit range from [0-9] to [01]. It does not affect the algorithms involved; but it does reduce the size of the resulting state transition table.

 

Word rules

We can derive the atomic word rules for the keywords as:

T{R1:[0] a [1] f [2] t [3] e [4] r [5 MATCH] }
T{R2:[0] a [1] n [2] d [3 MATCH] }
T{R3:[0] b [1] r [2] e [3] a [4] k [5 MATCH] }

For the word rules defining the permissible user-defined words, we must apply reduction rules, in order to get rid of the Kleene's closures in the word rules:

T{R4:[0] [ab] [1] [ab01]* [2 MATCH] }
=T{R4:[0] [ab] [1 MATCH] }
+ T{R4:[0] [ab] [1] [ab01] [2 MATCH] }
+ T{R4:[0] [ab] [1] [ab01] [2] [ab01] [1 MATCH] }

T{R5:[0] [01]+ [1 MATCH] }
=T{R5:[0] [01] [1 MATCH] }
+ T{R5:[0] [01] [1] [01] [1 MATCH] }

 

Atomic word rules

After applying the reduction rules to the Kleene's closures, we obtain the collection of atomic word rules:

T{R1:[0] a [1] f [2] t [3] e [4] r [5 MATCH] }
T{R2:[0] a [1] n [2] d [3 MATCH] }
T{R3:[0] b [1] r [2] e [3] a [4] k [5 MATCH] }
T{R41:[0] [ab] [1 MATCH] }
T{R42:[0] [ab] [1] [ab01] [2 MATCH] }
T{R43:[0] [ab] [1] [ab01] [2] [ab01] [2 MATCH] }
T{R51:[0] [01] [1 MATCH] }
T{R52:[0] [01] [1] [01] [1 MATCH] }
 

Deriving the state transitions from the atomic word rules

Algorithm:

  1. Start from the combined state zero in all atomic word rules;
  2. List each follow letter in each rule state in the combined state (the follow set for a combined state);
  3. For each follow letter, determine what the next combined state will be, when following that letter.
  4. If there is no transition available for a rule for a follow letter, eject the rule out of the next combined state;
  5. Go back to 2.

 

(R1-0,R2-0,R3-0,R41-0,R42-0,R43-0,R51-0,R52-0)
a --> (R1-1,R2-1,R41-1 MATCH, R42-1, R43-1)
b --> (R3-1,R41-1 MATCH, R42-1, R43-1)
01 --> (R51-1,R52-1)

(R1-1,R2-1,R41-1 MATCH, R42-1, R43-1)
f --> (R1-2)
n --> (R2-2)
ab01 --> (R42-2 MATCH, R43-2)

(R1-2)
t --> (R1-3)

(R1-3)
e --> (R1-4)

(R1-4)
r --> (R1-5 MATCH)

(R2-2)
d --> (R2-3 MATCH)

(R42-2 MATCH, R43-2)
ab01 --> (R43-1 MATCH)

(R43-1)
ab01 --> (R43-2 MATCH)

(R43-2 MATCH)
ab01 --> (R43-2 MATCH)

(R3-1,R41-1 MATCH, R42-1, R43-1)
r --> (R3-2)
ab01 --> (R42-2 MATCH, R43-2)

(R51-1,R52-1)
01 --> (R52-1 MATCH)

(R52-1 MATCH)
01 --> (R52-1 MATCH)

 

Renumbered states

Each combined state derived above, obtains its own state number:

0=(R1-0,R2-0,R3-0,R41-0,R42-0,R43-0,R51-0,R52-0)
1 MATCH R4=(R1-1,R2-1,R41-1 MATCH, R42-1, R43-1)
2 MATCH R4=(R3-1,R41-1 MATCH, R42-1, R43-1)
3=(R51-1,R52-1)
4=(R1-2)
5=(R2-2)
6 MATCH R4=(R42-2 MATCH, R43-2)
7=(R1-3)
8=(R1-4)
9 MATCH R1=(R1-5 MATCH)
10 MATCH R2=(R2-3 MATCH)
11=(R43-1)
12 MATCH R4=(R43-2 MATCH)
13=(R3-2)
14 MATCH R5=(R52-1 MATCH)
 

Renumbered transitions

We now replace the combined states by their corresponding numbers in the state transitions, in order to obtain the following state transition table:

(0;a;1 MATCH R4)
(0;b;2 MATCH R4)
(0;0;3)
(0;1;3)
(1;f;4)
(1;n;5)
(1;a;6 MATCH R4)
(1;b;6 MATCH R4)
(1;0;6 MATCH R4)
(1;1;6 MATCH R4)
(5;t;7)
(7;e;8)
(8;r;9 MATCH R1)
(5;d;10 MATCH R2)
(6;a;11)
(6;b;11)
(6;0;11)
(6;1;11)
(11;a;12 MATCH R4)
(11;b;12 MATCH R4)
(11;0;12 MATCH R4)
(11;1;12 MATCH R4)
(12;a;12 MATCH R4)
(12;b;12 MATCH R4)
(12;0;12 MATCH R4)
(12;1;12 MATCH R4)
(2;r;13)
(2;a;6 MATCH R4)
(2;b;6 MATCH R4)
(2;0;6 MATCH R4)
(2;1;6 MATCH R4)
(3;0;14 MATCH R5)
(3;1;14 MATCH R5)
(14;0;14 MATCH R5)
(14;1;14 MATCH R5)

 

Execution rule for the scanner

  • The initial state is state zero.
  • In the current state, if the next letter in the character stream causes a valid transition, go to the next state, regardless of whether the current state is a match or not (=greed);
  • In the current state, if the next letter does not cause a valid transition and there is a match for a rule: reduce the rule matched;
  • In the current state, if the next letter does not cause a valid transition and there is no match for any rule: fail (lexer error);
  • Repeat this in the next state.

 

Representing atomic rules programmatically

 

An atomic rule contains no regular expression' control structures and is of the following format:

Ri: {<state>,<charclass>}<state>

For example, in the notation used above:

R7:[0] a [1] [xyz] [2] t [3] [0-9] [4] r [5 MATCH]

A character class is a collection of characters and character ranges. A character is any letter limited to ascii lower range, such as 'a', 'k', '0', or '\n'(=newline) or other lower range ascii characters. A character range is a collection of characters in which the lower bound is separated from the higher bound by a hyphen. For example, the character range 'a-d' means the collection of the letters 'a', 'b', 'c', and 'd'.

We will represent a character class in the atomic rules by strings representing all member characters in the character class (expansion).

For example, the character class 'acfg-knpA-CZ' becomes after expansion: 'acfghijknpABCZ'.

We will represent a state, by a number, optionally followed by the letter 'M' to indicate that the state leads to a match for the rule.

The atomic rule for a keyword, is simply an initial zero state, followed by each letter in the keyword followed by a state. For example, for the keyword 'select':

atomicRule('select') == {'0','s','1','e','2' ,'l','2','e','3','c','4','t','5M' }

Another example, for a randomly constructed atomic word rule. Even though the rule is a bit nonsensical, it is still a valid atomic word rule:

atomicRule('somerule') == {'0','sabxyz','1','eff','2' ,'lt','2','e','3','cdddm','4M'}

The atomic word rule is the most essential data structure in the lexer-generator algorithm described above.
 

Representing states programmatically

A rule state is a cursor position inside one single rule. For example, R2-5, represents the position in state 5 of rule R2.

A rule state is a tuple (name,state):

<ruleState>: (<name>,<stateNumber>)

A combined state represents a cursor position inside multiple rules simultaneously. For example, (R2-3,R6-2), represents a simultaneous cursor position in rule R2 and in rule R6.

Therefore, a combined state is a collection of rule states.

<combinedState>: { <ruleState> }

The regular expression reduction process, however, will reduce each non-atomic word rule into a collection of atomic word rules. Therefore, we need to keep track of both the non-atomic name and the atomic name in the combined state:

<ruleState>: (<nonAtomicName>,<atomicName>,<stateNumber>)

For example, the rule state R41-1 MATCH becomes:

ruleState=('R4','R41', '1M')

Therefore, the combined state (R1-1,R2-1,R41-1 MATCH, R42-1, R43-1) will be represented programmatically as:

combinedState=((('R1','R1','1'),('R2','R2','1'),('R4','R41','1M'),('R4','R42','1'),('R4','R43','1'))

 

Renumbering

A state transition is a tuple representing a transition between two combined states, such as for example: (1;a;6 MATCH R4).

<transition>: (<stateFrom>,<character>,<stateTo> [,{<ruleName>}])

A state transition could match more than one word rule; and this would indeed be ambiguous.

Therefore, we can represent the following state transition (1;a;6 MATCH R4) programmatically as:

transition=('1','a','6', ('R4'))

 

Object model for (1) the state transition table generator and (2) the scanner

The algorithm described above, is data-structure heavy. In order to implement correctly the algorithms for both the state transition table generator and the scanner -- which will use the table generated -- we need to model their classes properly. In a next blog post, I will try to propose an object model for both programs.


blog comments powered by Disqus
 
 
Joomla 1.5 Templates by Joomlashack