Rainforest

Sankuru

Implémenter, personaliser, étendre et réparer Joomla/Virtuemart

Views: 4291

Nous vous aidons avec ...

Virtuemart
Joomfish
Autres extensions
SocialTwist Tell-a-Friend

Traduction automatique

English Arabic Chinese (Simplified) German Japanese Russian Spanish



Re-utilisons des sources libres

Les logiciels dont vous avez besoin, éxistent souvent déjà en source libre, et couvrent vos besoins à 80%. Nous ajouterons pour vous les 20% qui manquent.

Devis gratuit

Demandez gratuitement un devis aujourd'hui.

Building a compiler to address the EAV problem PDF Imprimer E-mail
Note des utilisateurs: / 0
MauvaisTrès bien 
Écrit par erik   
Mardi, 26 Mai 2009 22:36
There are no translations available.

What does a compiler do?

A compiler translates a text written in one grammar to a functionally equivalent text in another grammar. For example, a C compiler will translate a text written in the C grammar to a functionally equivalent text in assembly grammar:

Text in C grammar:

int add(int i,int j)
{
    int p = i + j;
    return p;
}

Text in Assembly grammar:

.globl add
add:
    pushl %ebp
    movl %esp, %ebp
    subl $4, %esp
    movl 8(%ebp),%edx
    addl 12(%ebp), %edx
    movl %edx, -4(%ebp)
    movl -4(%ebp), %eax
    leave

Functionally equivalent, means that the output of the one program is the same as the output of the other program.

If P1 is a program in the program space PP(C) of programs written in C; P1 can be executed by a C machine.
If P2 is a program in the program space PP(ASM) of programs written in Assembler; P2 can be executed by an ASM machine.
If R(P1) is the output of the program P1;
If R(P2) is the output of the program P2;

The function cP(P) -- called a "C Compiler" -- projects the program space PP(C) unto the program space PP(ASM) in such a way that:  R(P)==R(cP(P)).

A C machine, would be called a "C interpreter". A C machine can certainly be constructed. However, we may assume that such machine would be slower than an assembler machine, since assembler matches the CPU instructions of a typical CPU almost one by one. In fact, the CPU of a computer can be considered to be a native assembler machine.

Traditionally, compilers will project programs in a higher-level program space, such as the C program space, to programs in a lower-level program space, usually, assembler. However, compilers can project any program space unto any other program space.

 

Tokenizing: Transforming the stream of characters into a stream of tokens

The following stream of characters:

int add(int i,int j)
{
    int p = i + j;
    return p;
}

Becomes the following stream of tokens:

KEYWORD        int
IDENTIFIER      add
KEYWORD        (
KEYWORD        int
IDENTIFIER      i
KEYWORD        ,
KEYWORD        int
IDENTIFIER      j
KEYWORD        )
KEYWORD        {
KEYWORD        int
IDENTIFIER     p
KEYWORD        =
IDENTIFIER     i
KEYWORD        +
IDENTIFIER     j
KEYWORD        ;
KEYWORD        return
IDENTIFIER     p
KEYWORD        ;
KEYWORD        }

KEYWORD: predefined word
IDENTIFIER: user-defined word
STRING
: literal alphanumerical constant, such as "hello"
NUMBER
: literal numerical constant, such as 12.45
COMMENT
: user-defined descriptive text; has no incidence on program execution; such as "John, 12/1/2007, This line solves bug 23A3."

A program than groups letters into words, and qualifies them as KEYWORD, IDENTIFIER, STRING, NUMBER, or COMMENT, is called a lexer. The process is called tokenizing.

The first step in compiling a program, consists in grouping the stream of letters in the source code into a stream of words. We will elaborate in a future blog post, how exactly a typical lexer works.

 

Parsing: Transforming the stream of tokens into a tree of sentences.

The token stream above now becomes:

FUNCTION
    SIGNATURE
        RETURN TYPE
            KEYWORD        int
        FUNCTION NAME
            IDENTIFIER    add
        PARAMETERS
            KEYWORD        (
            PARAMETER
                TYPE:    KEYWORD        int
                NAME:    IDENTIFIER    i
            KEYWORD        ,
            PARAMETER
                TYPE:    KEYWORD        int
                NAME:    IDENTIFIER    j
            KEYWORD        )
    BODY
        KEYWORD        {
            EXPRESSION
                PUSH_STACK:IDENTIFIER    i
                PUSH_STACK:IDENTIFIER    j
                SUM_STACK:KEYWORD    +
                TYPE_CONSTRAINT:KEYWORD    int
                POP_STACK:IDENTIFIER    p
            EXPRESSION
                PUSH_STACK: IDENTIFIER    p
                RETURN:    KEYWORD    return
        KEYWORD        }

There are two types of sentences: the declarative ones, aka the "administrative chrome", and the expressions. The declarative sentences mostly enforce the validity of expressions. They are not executable. The expressions themselves are sequences of stack-based instructions. The expressions are executable.

The parser program must do the following things:

  • Group the lexer tokens into a sentence tree. The top-level sentence (node) is usually called "Program" or "Unit" or something similar.
  • Reverse the expressions into reverse post-polish notation in order to make them stack executable. The reversal process is almost unavoidable, if only, because the validity of the expression depends on what types are present on the stack during every point of execution of the expression.
  • Enforce the administrative chrome of declarations by refusing to deal with expressions that are obviously invalid. For example, if f is declared a function int f(int a, int b) as taking 2 parameters, the compiler should refuse to deal with expressions such as f("abc",x,0) because it involves 3 parameters instead of 2; of which the first one is a string instead of a integer.

 

Stack-based execution

The stack is a last-in, first-out series of values. The push instruction will add a value to the stack. If the stack contains [a b c], push d, will make the stack contain [a b c d]. The pop instruction will remove a value from the stack. If the stack contains [a b c], pop, will make the stack contain [a b].

Every other instruction is to be understood as a function that removes a certain number of parameters from the stack and pushes its result unto the stack. For example, the "+" function will remove two parameters from the stack and push the sum of these two parameters unto the stack.

Expressions ordered in reverse polish notation, can be executed on a stack.

For example: the expression a+b+c becomes in reverse polish notation: a b + c +.

The sum function takes the two topmost elements from the stack and replaces them by their sum.

push a        stack: a
push b        stack: a b
call +        stack: a+b 
push c        stack: a+b c
call +        stack: a+b+c

Another example, the expression: r=log(f(a,b)/2)*3 - g(x,y,z)

push a        stack: a
push b        stack: a b
call f      stack: f(a,b)
push 2      stack: f(a,b) 2
call /      stack: f(a,b)/2
call log     stack: log(f(a,b)/2)
push 3        stack: log(f(a,b)/2) 3
call *        stack: log(f(a,b)/2)*3
push x        stack: log(f(a,b)/2)*3 x
push y        stack: log(f(a,b)/2)*3 x y
push z        stack: log(f(a,b)/2)*3 x y z
call g        stack: log(f(a,b)/2)*3 g(x,y,z)
call -        stack: log(f(a,b)/2)*3 - g(x,y,z)
pop r        stack: empty

 

Object-oriented extensions

The object expression a->f(x) is equivalent to the functional expression #(a,f){a,x}. The function f itself must be first retrieved using a function call #. The result of this call is a function again, which will be invoked with the object and its parameters as function arguments.

push a         stack: a
push x        stack: a x
push a        stack: a x a
push f         stack: a x a f
#          stack: a x 2 f() -- # recovers the function f applicable when applied to a
invoke        stack: f(a,x)    -- "invoke" calls the method on the top of the stack, with a given number [2] of parameters

 

Compiling from high-level to low-level languages

Translating a C expression from its stack-based representation into a stack-based assembler representation is relatively easy. Translating the declarative sentences in assembler is mostly unneeded. Assembler has very little power to validate declarations anyway. Assembler will happily execute semantically very invalid code. That is why the C compiler must check the validations with the C tokens before translating to assembler. In assembler, it would simply be too late to check them.

 

Compiling back to the same grammar

Usually, a compiler will project the one program space associated to language 1, PP(L1), into another program space associated to language 2, PP(L2), in such a way that the resulting program output remains the same, that is: R(P)==R(cL1L2(P)).

In the case of the EAV compiler that we want to construct, we need to project the SQL program space into itself, in such a way that the resulting program output remains the same: R(P)==R(qEAV(P)).

A compiler that projects a program space into itself, is actually a rewrite rule. It rewrites a program, that is, a series of sentences, written in a particular grammar, into an equivalent series of sentences, written in the same grammar.

 

Constructing the EAV compiler

In order to address the EAV problem, we need to create a minimal toolkit for compiler construction, that is, something that can do a minimum of tokenizing, and something that can do some minimal parsing. There are many such toolkits available, especially in C, but unfortunately not for direct use in Php.

Given the typical security measures enforced in hosting a LAMP stack, we cannot generally count on being able to invoke existing, native programs or native libraries that already implement compiler construction technology. For example, the traditional Lex, Yacc, and similar programs, are inaccessible from the typically secured LAMP stack.

There seems to have been some efforts going on in creating a Php lexer-parser generator, that generates Php code, but it apparently requires doing it all in C or Java. It also looks very much unsupported, and is therefore usable only as inspiration.

Therefore, in spite of my dislike for re-inventing the wheel, I am now being pushed into creating a minimal compiler construction toolkit. I will try to cut all features out of the toolkit, that are not strictly needed for building the EAV compiler, and just try to stick to the barebone essentials.

 


blog comments powered by Disqus
 
 
Joomla 1.5 Templates by Joomlashack