Rainforest

Sankuru

Softwarediensten voor Uw Joomla/Virtuemart projecten

Views: 350
SocialTwist Tell-a-Friend

Wij helpen met ...

Virtuemart
Joomfish
Andere extensies

Automatische vertaling

English Arabic Chinese (Simplified) German Japanese Russian Spanish



Hergebruik open source

Datgene wat U nodig hebt, bestaat vaak al, en dekt 80% van Uw behoeften. Wij zorgen voor de ontbrekende 20%.

Garantie

3 maanden gratis

Snel contact

Building a tokenizer for the EAV problem, part 3 PDF Afdrukken E-mail
Waardering: / 0
SlechtZeer goed 
Geschreven door erik   
vrijdag 29 mei 2009 04:01
There are no translations available.

Is parsing regular expressions hard?

Word rules must be reduced to atomic word rules for the lexer generator described in a previous post to be able to generate the state transitions. Of course, as demonstrated before, we can do that manually. The question now is whether it would take a lot of programming effort to do it automatically. Let's have a look.

A regular expression parser does not need a tokenizer, because all tokens are just one character long. The grammar for regular expressions is also relatively simple.

The first step consists in expanding the character classes. Letters preceded by the escape character '\', should not be treated as control characters. They are literal characters. In addition to that, if the parser sees a Kleene operator, it should replace it by its symbol. For example:

wordRule('a(z(bc)*x[0-3])+d?e'):

wR0=('a',PARENS_LEFT,'z',PARENS_LEFT,'b','c',PARENS_RIGHT,KLEENE_STAR,
           'x','0123',PARENS_RIGHT,KLEENE_PLUS,'d',KLEENE_OPT,'e')

 

Adding the rule states

The next step is to add the rule states <x>:

  • at the start of the word rule
  • after each character or character class


wR0=(<0>,'a',<1>,PARENS_LEFT,'z',<2>,PARENS_LEFT,'b',<3>,'c',<4>,PARENS_RIGHT,KLEENE_STAR,
    'x',<5>,'0123',<6>,PARENS_RIGHT,KLEENE_PLUS,'d',<7>,KLEENE_OPT,'e',<8>)
 

Parsing the embedded word rules

The next step is to parse embedded word rules. As soon as the parser sees a left parenthesis, it should start parsing an embedded word rule. As soon as it sees a right parenthesis, it should stop parsing the embedded word rule and continue parsing the parent word rule:

wR0=(<0>,'a',<1>,[wR1],KLEENE_PLUS,'d',<7>,KLEENE_OPT,'e',<8>)
wR1=('z',<2>,[wR2],KLEENE_STAR,'x',<5>,'0123',<6>)
wR2=('b',<3>,'c',<4>)

This is a regular expression syntax tree.

 

Reducing the Kleene operators

The next step is to reduce the Kleene operators:

Reduce KLEENE_PLUS in wR0 into wR01 and wR02:
wR01=(<0>,'a',<1>,[wR1],'d',<7>,KLEENE_OPT,'e',<8>)
wR02=(<0>,'a',<1>,[wR1],[wR1],'d',<7>,KLEENE_OPT,'e',<8>)

Reduce KLEENE_OPT in wR01 into wR011 and wR012:
wR011=(<0>,'a',<1>,[wR1],'e',<8>)
wR012=(<0>,'a',<1>,[wR1],'d',<7>,'e',<8>)
 

Reduce KLEENE_OPT in wR02 into wR021 and wR022:
wR021=(<0>,'a',<1>,[wR1],[wR1],'e',<8>)
wR022=(<0>,'a',<1>,[wR1],[wR1],'d',<7>,'e',<8>)

Reduce KLEENE_STAR in wR1 into wR11, wR12, and wR13:
wR11=('z',<2>,'x',<5>,'0123',<6>)
wR12=('z',<2>,[wR2],'x',<5>,'0123',<6>)
wR13=('z',<2>,[wR2],[wR2],'x',<5>,'0123',<6>)

The Kleene operators are now gone.

 

The syntax tree for the regular expression

The syntax tree for the regular expression has now become:

LEVEL 0:

wR011=(<0>,'a',<1>,[wR1],'e',<8>)
wR012=(<0>,'a',<1>,[wR1],'d',<7>,'e',<8>)
wR021=(<0>,'a',<1>,[wR1],[wR1],'e',<8>)
wR022=(<0>,'a',<1>,[wR1],[wR1],'d',<7>,'e',<8>)

LEVEL 1

wR11=('z',<2>,'x',<5>,'0123',<6>)
wR12=('z',<2>,[wR2],'x',<5>,'0123',<6>)
wR13=('z',<2>,[wR2],[wR2],'x',<5>,'0123',<6>)

LEVEL 2

wR2=('b',<3>,'c',<4>) 

 

Flattening the syntax tree

Now we will flatten the syntax tree. We remove the rules of the lowest level, level 2, by copying them verbatim into the rules of level 1:

REMOVING LEVEL 2

LEVEL 0:

wR011=(<0>,'a',<1>,[wR1],'e',<8>)
wR012=(<0>,'a',<1>,[wR1],'d',<7>,'e',<8>)
wR021=(<0>,'a',<1>,[wR1],[wR1],'e',<8>)
wR022=(<0>,'a',<1>,[wR1],[wR1],'d',<7>,'e',<8>)

LEVEL 1:

wR11=('z',<2>,'x',<5>,'0123',<6>)
wR12=('z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>)
wR13=('z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>)

wR1 in level 1 has 3 alternatives. This means that every rule in level 0 that contains wR1 will have 3 alternatives:

REMOVING LEVEL 1

LEVEL 0:

wR011-1=(<0>,'a',<1>,'z',<2>,'x',<5>,'0123',<6>,'e',<8>)
wR011-2=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'e',<8>)
wR011-3=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'e',<8>)

wR012-1=(<0>,'a',<1>,'z',<2>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)
wR012-2=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)
wR012-3=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)

wR021-1=(<0>,'a',<1>,'z',<2>,'x',<5>,'0123',<6>,'z',<2>,'x',<5>,'0123',<6>,'e',<8>)
wR021-2=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'e',<8>)
wR021-3=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,
'0123',<6>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'e',<8>)

wR022-1=(<0>,'a',<1>,'z',<2>,'x',<5>,'0123',<6>,'z',<2>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)
wR022-2=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)
wR022-3=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,
'0123',<6>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)

We are now at level 0. We cannot remove any more levels.

The word rules above do no longer contain any Kleene operators or embedded word rules. Therefore, they are now atomic.

As we can see above, the Kleene (regular expression-) word rule, wR0: a(z(bc)*x[0-3])+d?e, can be replaced by 12 equivalent atomic word rules. From there, the atomic word rules can ordinarily participate in the lexer, along with word rules for keywords. The 12 atomic word rules all match wR0.
 

Unicode support: the ANY and OTHER character classes

Another issue, is the support for unicode in character classes. Nothing actually prevents the lexer from reading utf-8 character by utf-8 character. However, the ANY character class, symbolized by a '.', would require expanding it across the entire utf-8 range of characters. That may yield something in the area of 65535 characters. It would be onerous to list them all in the word rules. Therefore, we must avoid expanding the ANY character class.

A solution could be to create default state transitions. Imagine that we have 4 transitions departing from a particular state, for the characters 'a','q', '4', and ANY. In this context, ANY actually means OTHER than 'a', 'q', and '4'.

Therefore, if ANY is expected in a particular state, it will create the need for a state transition for the character class OTHER. I am going to try out this solution, in order to avoid problems with utf-8 character classes.

 

The RE compiler is a rewrite rule

Each valid regular expressions P, is a member of the Kleene program space, KR, of regular expressions.

A regular expression is a program complying to the grammar rules of the regular expression programming language RE.

Therefore, the function cRE, described above, is a compiler that projects the Kleene program space unto itself, in such a way that the collection of transitions T{P} for the regular expression P are equal to the collection of transitions for the rewritten expressions Pcre=T{cRE(P)}.

Pcre=cRE(P) as such that T{P}==T{cRE(P)}

Since the cRE(P) compiler compiles a program in RE language to an equivalent program in the same RE language, it is a rewrite rule. 

 

Programming effort

In my impression, the algorithm described above is not that onerous. It takes only reasonable effort to program it. Therefore, I think I will add support for the most popular regular expressions in the lexer:

  • The typical Kleene operators: (..)*  (..)+  (..)?
  • The OR operator: (..|..)


I think that should be sufficient enough to support most of the typical regular expression one could use to define a word grammar. 640 KB ought to be enough for anybody!

 


blog comments powered by Disqus
 

Virtuemart

Gebruik Virtuemart om Uw producten online te promoten en te verkopen.

Joomfish

Gebruik Joomfish om ook in het Engels en andere vreemde talen te publiceren en een internationale klantenbasis op te bouwen.
Joomla 1.5 Templates by Joomlashack