Rainforest

Sankuru

Implémenter, personaliser, étendre et réparer Joomla/Virtuemart

Views: 795

Nous vous aidons avec ...

Virtuemart
Joomfish
Autres extensions
SocialTwist Tell-a-Friend

Traduction automatique

English Arabic Chinese (Simplified) German Japanese Russian Spanish



Re-utilisons des sources libres

Les logiciels dont vous avez besoin, éxistent souvent déjà en source libre, et couvrent vos besoins à 80%. Nous ajouterons pour vous les 20% qui manquent.
Building a tokenizer for the EAV problem, part 3 PDF Imprimer E-mail
Note des utilisateurs: / 0
MauvaisTrès bien 
Écrit par erik   
Vendredi, 29 Mai 2009 04:01
There are no translations available.

Is parsing regular expressions hard?

Word rules must be reduced to atomic word rules for the lexer generator described in a previous post to be able to generate the state transitions. Of course, as demonstrated before, we can do that manually. The question now is whether it would take a lot of programming effort to do it automatically. Let's have a look.

A regular expression parser does not need a tokenizer, because all tokens are just one character long. The grammar for regular expressions is also relatively simple.

The first step consists in expanding the character classes. Letters preceded by the escape character '\', should not be treated as control characters. They are literal characters. In addition to that, if the parser sees a Kleene operator, it should replace it by its symbol. For example:

wordRule('a(z(bc)*x[0-3])+d?e'):

wR0=('a',PARENS_LEFT,'z',PARENS_LEFT,'b','c',PARENS_RIGHT,KLEENE_STAR,
           'x','0123',PARENS_RIGHT,KLEENE_PLUS,'d',KLEENE_OPT,'e')

 

Adding the rule states

The next step is to add the rule states <x>:

  • at the start of the word rule
  • after each character or character class


wR0=(<0>,'a',<1>,PARENS_LEFT,'z',<2>,PARENS_LEFT,'b',<3>,'c',<4>,PARENS_RIGHT,KLEENE_STAR,
    'x',<5>,'0123',<6>,PARENS_RIGHT,KLEENE_PLUS,'d',<7>,KLEENE_OPT,'e',<8>)
 

Parsing the embedded word rules

The next step is to parse embedded word rules. As soon as the parser sees a left parenthesis, it should start parsing an embedded word rule. As soon as it sees a right parenthesis, it should stop parsing the embedded word rule and continue parsing the parent word rule:

wR0=(<0>,'a',<1>,[wR1],KLEENE_PLUS,'d',<7>,KLEENE_OPT,'e',<8>)
wR1=('z',<2>,[wR2],KLEENE_STAR,'x',<5>,'0123',<6>)
wR2=('b',<3>,'c',<4>)

This is a regular expression syntax tree.

 

Reducing the Kleene operators

The next step is to reduce the Kleene operators:

Reduce KLEENE_PLUS in wR0 into wR01 and wR02:
wR01=(<0>,'a',<1>,[wR1],'d',<7>,KLEENE_OPT,'e',<8>)
wR02=(<0>,'a',<1>,[wR1],[wR1],'d',<7>,KLEENE_OPT,'e',<8>)

Reduce KLEENE_OPT in wR01 into wR011 and wR012:
wR011=(<0>,'a',<1>,[wR1],'e',<8>)
wR012=(<0>,'a',<1>,[wR1],'d',<7>,'e',<8>)
 

Reduce KLEENE_OPT in wR02 into wR021 and wR022:
wR021=(<0>,'a',<1>,[wR1],[wR1],'e',<8>)
wR022=(<0>,'a',<1>,[wR1],[wR1],'d',<7>,'e',<8>)

Reduce KLEENE_STAR in wR1 into wR11, wR12, and wR13:
wR11=('z',<2>,'x',<5>,'0123',<6>)
wR12=('z',<2>,[wR2],'x',<5>,'0123',<6>)
wR13=('z',<2>,[wR2],[wR2],'x',<5>,'0123',<6>)

The Kleene operators are now gone.

 

The syntax tree for the regular expression

The syntax tree for the regular expression has now become:

LEVEL 0:

wR011=(<0>,'a',<1>,[wR1],'e',<8>)
wR012=(<0>,'a',<1>,[wR1],'d',<7>,'e',<8>)
wR021=(<0>,'a',<1>,[wR1],[wR1],'e',<8>)
wR022=(<0>,'a',<1>,[wR1],[wR1],'d',<7>,'e',<8>)

LEVEL 1

wR11=('z',<2>,'x',<5>,'0123',<6>)
wR12=('z',<2>,[wR2],'x',<5>,'0123',<6>)
wR13=('z',<2>,[wR2],[wR2],'x',<5>,'0123',<6>)

LEVEL 2

wR2=('b',<3>,'c',<4>) 

 

Flattening the syntax tree

Now we will flatten the syntax tree. We remove the rules of the lowest level, level 2, by copying them verbatim into the rules of level 1:

REMOVING LEVEL 2

LEVEL 0:

wR011=(<0>,'a',<1>,[wR1],'e',<8>)
wR012=(<0>,'a',<1>,[wR1],'d',<7>,'e',<8>)
wR021=(<0>,'a',<1>,[wR1],[wR1],'e',<8>)
wR022=(<0>,'a',<1>,[wR1],[wR1],'d',<7>,'e',<8>)

LEVEL 1:

wR11=('z',<2>,'x',<5>,'0123',<6>)
wR12=('z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>)
wR13=('z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>)

wR1 in level 1 has 3 alternatives. This means that every rule in level 0 that contains wR1 will have 3 alternatives:

REMOVING LEVEL 1

LEVEL 0:

wR011-1=(<0>,'a',<1>,'z',<2>,'x',<5>,'0123',<6>,'e',<8>)
wR011-2=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'e',<8>)
wR011-3=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'e',<8>)

wR012-1=(<0>,'a',<1>,'z',<2>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)
wR012-2=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)
wR012-3=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)

wR021-1=(<0>,'a',<1>,'z',<2>,'x',<5>,'0123',<6>,'z',<2>,'x',<5>,'0123',<6>,'e',<8>)
wR021-2=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'e',<8>)
wR021-3=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,
'0123',<6>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'e',<8>)

wR022-1=(<0>,'a',<1>,'z',<2>,'x',<5>,'0123',<6>,'z',<2>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)
wR022-2=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'z',<2>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)
wR022-3=(<0>,'a',<1>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,
'0123',<6>,'z',<2>,'b',<3>,'c',<4>,'b',<3>,'c',<4>,'x',<5>,'0123',<6>,'d',<7>,'e',<8>)

We are now at level 0. We cannot remove any more levels.

The word rules above do no longer contain any Kleene operators or embedded word rules. Therefore, they are now atomic.

As we can see above, the Kleene (regular expression-) word rule, wR0: a(z(bc)*x[0-3])+d?e, can be replaced by 12 equivalent atomic word rules. From there, the atomic word rules can ordinarily participate in the lexer, along with word rules for keywords. The 12 atomic word rules all match wR0.
 

Unicode support: the ANY and OTHER character classes

Another issue, is the support for unicode in character classes. Nothing actually prevents the lexer from reading utf-8 character by utf-8 character. However, the ANY character class, symbolized by a '.', would require expanding it across the entire utf-8 range of characters. That may yield something in the area of 65535 characters. It would be onerous to list them all in the word rules. Therefore, we must avoid expanding the ANY character class.

A solution could be to create default state transitions. Imagine that we have 4 transitions departing from a particular state, for the characters 'a','q', '4', and ANY. In this context, ANY actually means OTHER than 'a', 'q', and '4'.

Therefore, if ANY is expected in a particular state, it will create the need for a state transition for the character class OTHER. I am going to try out this solution, in order to avoid problems with utf-8 character classes.

 

The RE compiler is a rewrite rule

Each valid regular expressions P, is a member of the Kleene program space, KR, of regular expressions.

A regular expression is a program complying to the grammar rules of the regular expression programming language RE.

Therefore, the function cRE, described above, is a compiler that projects the Kleene program space unto itself, in such a way that the collection of transitions T{P} for the regular expression P are equal to the collection of transitions for the rewritten expressions Pcre=T{cRE(P)}.

Pcre=cRE(P) as such that T{P}==T{cRE(P)}

Since the cRE(P) compiler compiles a program in RE language to an equivalent program in the same RE language, it is a rewrite rule. 

 

Programming effort

In my impression, the algorithm described above is not that onerous. It takes only reasonable effort to program it. Therefore, I think I will add support for the most popular regular expressions in the lexer:

  • The typical Kleene operators: (..)*  (..)+  (..)?
  • The OR operator: (..|..)


I think that should be sufficient enough to support most of the typical regular expression one could use to define a word grammar. 640 KB ought to be enough for anybody!

 


blog comments powered by Disqus
 
 
Joomla 1.5 Templates by Joomlashack