Rainforest

Sankuru

Implementing, customizing, extending, and troubleshooting Joomla/Virtuemart

Views: 771
SocialTwist Tell-a-Friend

Machine translation

English Arabic Chinese (Simplified) German Japanese Russian Spanish



Re-use open source

What you need, often exists already, and covers your requirements for 80%. We will add the remaining 20% for you.

Free quote

Request a free quote today.

Building a tokenizer for the EAV problem PDF Print E-mail
User Rating: / 0
PoorBest 
Written by erik   
Wednesday, 27 May 2009 04:55

A basic tokenizer matching one word

Imagine we want to match the word "hello" at the beginning of a character stream. Every time we read a character, we must remember that we have read it, and that we expect the next one. Our program has a state. After reading a character, it moves to another state. The state of the program depends on what letters we have already read.

R: [0] h [1] e [2] l [3] l [4] o [5 MATCH]

The initial state, is state zero. We have not read anything. After reading the letter 'h', the program moves to the next state [1]. After reading the letter 'e', the program moves to the next state, that is, [2], and so on. The letters in the character stream cause transitions from one state to another in the program. These transitions are:

(0;h;1)
(1;e;2)
(2;l;3)
(3;l;4)
(4;o;5 MATCH)

If we are in state [3], and we read the letter 'p', we can see that there is no such transition available. The program will fail to move to the next state.

The requirement to match a word, is a rule, that can be expressed through a collection of state transitions.

 

A basic tokenizer matching multiple words simultaneously

We may want to match the words "while", "when", "winter", and "yellow" at the beginning of a character stream. We now have four rules to match, instead of one:

R1: [0] w [1] h [2] i [3] l [4] e [5 MATCH]
R2: [0] w [1] h [2] e [3] n [4 MATCH]
R3: [0] w [1] i [2] n [3] t [4] e [5] r [6 MATCH]
R4: [0] y [1] e [2] l [3] l [4] o [5] w [6 MATCH]

The initial state of the program is the state zero in every rule: (R1-0,R2-0,R3-0,R4-0). If we read the letter 'w', the combined state (the cursor) moves to (R1-1,R2-1,R3-1). If we read the letter 'y', the cursor moves to (R4-1).

The transitions are:

(R1-0,R2-0,R3-0,R4-0; w; R1-1,R2-1,R3-1)
(R1-0,R2-0,R3-0,R4-0; y; R4-1)
(R1-1,R2-1,R3-1; h; R1-2,R2-2)
(R1-2,R2-2; i; R1-3)
(R1-3; l; R1-4)
(R1-4; e; R1-5 MATCH R1)
(R1-2,R2-2; e; R2-3)
(R2-3; n; R2-4 MATCH R2)
(R1-1,R2-1,R3-1; i; R3-2)
(R3-2; n; R3-3)
(R3-3; t; R3-4)
(R3-4; e; R3-5)
(R3-5; r; R3-6 MATCH R3)
(R4-1; e; R4-2)
(R4-2; l; R4-3)
(R4-3; l; R4-4)
(R4-4; o; R4-5)
(R4-5; l; R4-6 MATCH R4)
 

We can renumber the states. The actual numbers don't matter:

0 = R1-0,R2-0,R3-0,R4-0
1 = R1-1,R2-1,R3-1
2 = R1-2,R2-2
3 = R1-3
4 = R1-4
5 = R1-5 MATCH R1
6 = R1-2,R2-2
7 = R2-3
8 = R2-4 MATCH R2
9 = R1-1,R2-1,R3-1
10 = R3-2
11 = R3-3
12 = R3-4
13 = R3-5
14 = R3-6 MATCH R3
15 = R4-1
16 = R4-2
17 = R4-3
18 = R4-4
19 = R4-5
20 = R4-6 MATCH R4

The collection of transitions becomes:

(0; w; 1)
(0; y; 4)
(1; h; 2)
(2; i; 3)
(3; l; 4)
(4; e; 5 MATCH R1)
(2; e; 7)
(7; n; 8 MATCH R2)
(9; i; 10)
(10; n; 11)
(11; t; 12)
(12; e; 13)
(13; r; 14 MATCH R3)
(15; e; 16)
(16; l; 17)
(17; l; 18)
(18; o; 19)
(19; l; 20 MATCH R4)
 

As you can see, the lexer will match one word or one million words equally fast. The number of words to match does not have any impact on the number of steps the lexer most go through.

Notes:

  • The collection of state transitions would be invalid, if there is more than one transition available for state s and character c. The choice of what state to go next to, would be undetermined, as the program could go to more than one next state. The algorithm above does not introduce such ambiguities.
  • If the list of words contain duplicates, the algorithm will simply end up matching more than one word. This is allowed.

 

Matching regular expressions

We may want to match a word, in which a part may be repeated or omitted. For example, a(bc)*d matches:

ad
abcd
abcbcd
abcbcbcd
abcbcbcbcd
...

The repetition operator is called Kleene's closure.

The most popular control instructions in regular expressions, are (..)*, (..)+, and (..)?.

The strict repetition operator (..)+ as in a(bc)+d matches:

abcd
abcbcd
abcbcbcd
abcbcbcbcd
...

The optionality operator (..)? as in a(bc)?d, matches:

ad
abcd

How do we derive the state transitions for Kleene's closure?

Traditionally, we use Thompson's algorithm to compute the state transitions for Kleene's closure.

There exists, however, a simplifying reduction rule. Look at the transitions in the example:

[0] a [1] ( b [2] c [3] )*d [4]

It generates the following series of transitions:

[0] a [1] d [4]
--> (0;a;1), (1;d;4)

[0] a [1]  b [2] c [3] d [4]
--> (0;a;1), (1;b;2), (2;c;3), (3;d;4)

[0] a [1]  b [2] c [3]  b [2] c [3] d [4]
--> (0;a;1), (1;b;2), (2;c;3), (3;b;2), (3;d;4)

[0] a [1]  b [2] c [3]  b [2] c [3] b [2] c [3] d [4]
--> (0;a;1), (1;b;2), (2;c;3), (3;b;2), (3;d;4)
 
[0] a [1]  b [2] c [3]  b [2] c [3] b [2] c [3] b [2] c [3] d [4]
--> (0;a;1), (1;b;2), (2;c;3), (3;b;2), (3;d;4)
...

As you can see, no new transitions are being generated after the third iteration.

The transitions are:

(0;a;1)
(1;d;4)
(1;b;2)
(2;c;3)
(3;d;4)
(3;b;2)

Therefore, we can find the transitions with the following reduction rule:

T{a(bc)*d} = T{ad} + T{abcd} + T{abcbcd}

meaning of course:

T{[0] a [1] (b [2] c [3] )*d [4] } = T{[0] a [3] d [4]} + T{[0] a [1] b [2] c [3] d [4]} + T{[0] a [1] b [2] c [3] b [2] c [3] d [4]}

Similarly, we can find the transitions for the (..)+ and (..)? operators by using a reduction rule:

T{a(bc)+d} = T{abcd} + T{abcbcd}
T{a(bc)?d} = T{ad} + T{abcd}
 

Other control instructions in regular expressions

The popular tools sed, grep, and perl implement quite a few additional control instructions for use in regular expressions.

For example, the OR operator:

R: a ( b | c) d

We can reduce this operator, by splitting the rule in two:

T{a(b|c)d} = T{abd} + T{acd}

Another example, are character classes. They are also useful control instructions in regular expressions. For example, [0-9] matches all digits. Therefore, we can use it in expressions such as: [a-z][0-9]+, which matches:

a898
a777777
b8989

... or any other word that starts with a letter and is followed by one or more digits.
 

Lexers and regular expressions

Even though a lexer tool like Lex will implement something close to full support for regular expressions, full regular expression support is not needed in order to tokenize most programming languages. A tokenizer will match the following word classes:

whitespace: space, newline, tab, and other ignored characters
keywords
: words predefined in the grammar
identifiers: user-defined words
numbers
: numerical constants
strings: alphanumerical constants
comments: descriptive text that can be ignored

Kleene's closure is typically only used to match identifiers and numbers:

  • Identifiers are typically valid in common programming languages, if they match a regular expression such as: [_a-zA-Z][_a-zA-Z0-9]*.
  • Numbers are typically valid, when they match a regular expression such as: [0-9]+ | [0-9]* \.[0-9]+ ([Ee](\+|-)?[0-9]+)?


We also do not need Kleene's closure to match whitespace or keywords.

We also do not need Kleene's closure to match strings or comments. As we do not want to match any other token within a string or a comment, these word classes are typically handled by sublexers. These sublexers do not match any other expressions for which the main lexer is searching.

Therefore, it may be worthwhile to figure out how to support the matching process of identifiers and numbers in the EAV compiler we are building, without painstakingly building full support for algorithms that will generally match any number of Kleene's closures simultaneously. We only need to support two particular Kleene expressions.

 

Regular expressions in the tokenizing process are greedy

If, for example, the word 'select' is a keyword in the grammar for which we are building a tokenizer, then it will match the first part of the word 'selection". However, 'selection' is an identifier, which happens to starts with a keyword. Therefore, we may not match a word rule, if we can still continue matching another word rule.

Therefore, when successfully matching a word rule, we must look ahead in the character stream, and ascertain that the next letter leads to a failure. If we match a word rule in position i of the character stream str in the state s, we must check that there is no valid transition available for (s, str[i+1]); otherwise, we must keep matching.

Identifiers and numbers are greedy, but not mutually ambiguous. After the first letter -- if it is not a digit -- we can rule out the presence of a number. Therefore, ambiguity during the matching process is typically the result of the competition between one or more remaining keywords and the identifier rule.

 

Punctuation

While the following statement is valid:

select 1 from mytable

The next statement is considered invalid:

select 1from mytable

Therefore, It is not sufficient that the next letter after a match, leads to failure. This next letter must also be punctuation: whitespace or punctuation (particular keywords), such as comma, semicolon, colon, dot, and so on.

In the example above, the lexer would match the number '1' with the word rule for numbers, while there would be no other rule available to continue matching with the letter 'f'. Therefore, the match would be successful. However, the letter 'f' is not punctuation. Therefore, the letter 'f' right after the number 1, still constitutes a lexer error, because 'f' is not a punctuation character.

select 1,from mytable

From the point of view of the lexer, the statement above is valid. It is the parser which will balk at the presence of a comma after the number '1', but not the lexer. The comma is a valid punctuation character, and therefore, can follow right after a successful match.

 

The lexer program

In order to group letters in a character stream into a stream of words, we need:

  • a state transition table, which is different according to the grammar for which we tokenize.
  • a program -- a scanner -- that uses the state transition table to match the words in the character stream; the scanner is more or less the same for each grammar.


We can;

  • either generate the table separately through a transition table generator and use a separate scanner program;
  • or else create a program that generates the transition table on the fly and also implements the scanner.


The best solution would be a tokenizer that is as simple as possible, but still powerful enough to tokenize SQL correctly. Let's try to determine in a next blog post what such tokenizer could look like.

 


blog comments powered by Disqus
 
 
Joomla 1.5 Templates by Joomlashack