A simple bytecode compiler with virtual machine, written in Php, for the EL language
- Details
- Published on Wednesday, 28 December 2011 08:26
- Hits: 4910
1. Download the ExpressionLang (EL) bytecode compiler and virtual machine
2. Lexing and parsing in Php
In my previous blog posts, I demonstrated how we can use the builtin PCRE library, to create a lexer in Php. I also showed how to use Bison-generated LALR1 parser tables in Php. In this blog post, I will re-use these lexing and parsing facilities to compile EL programs from within Php.
3. An EL example program
EL is a simplistic language that can only contain expressions. For example:
Show/Hidden php code/* This is an expression-lang program */ var1=33; var2=45; $gdp=-9*1.1-1; var3=(var1+var2)/var2; println('var1 is:'.var1); println('var2 is:'.var2); println('var3 is:'.var3); start=1+var1^3; println('start is:'.start); println(4); println('some string'); //just print something x=var1+var2; println('x='.x); println(5+3-8+2/2-1+19); println("a"."b"); y=var1||var2; println('y='.y);
EL can handle:
- expressions: e.g. a*b/c-12
- assignments: e.g. g=x+5/y
- function calls: e.g. println(f(x))
In EL, you can call any pre-existing function, but you cannot define new functions. You can only call functions previously registered inside the VM.
4. The EL Bison grammar
The formal Bison grammar for EL is:
Show/Hidden php code%token /* punctuation */ SEMICOLON COMMA %token /* brackets */ BRACKET_LEFT BRACKET_RIGHT %token /* operators */ ASSIGN PLUS MINUS MULTIPLY DIVIDE EXPONENTIATE MODULO CONCAT %token /* condition operators */ EQ NE LT GT LE GE NOT %token /* condition operators */ OR AND %token /* constants */ IDENTIFIER NUMBER %token /* constants */ STRING %token /* constants */ TRUE FALSE NULL /* arithmetic priorities */ %left ASSIGN %left OR AND %left EQ NE LT GT LE GE NOT %left PLUS MINUS %left MULTIPLY DIVIDE MODULO %left CONCAT %left UNARY %right EXPONENTIATE %% script: statements; script : ; statements: statements statement; statements: statement; /* expression statement */ statement: expression_statement; expression_statement: expression SEMICOLON; /* expression */ expression: BRACKET_LEFT expression BRACKET_RIGHT; expression: expression PLUS expression; expression: expression MINUS expression; expression: expression DIVIDE expression; expression: expression MULTIPLY expression; expression: expression EXPONENTIATE expression; expression: expression MODULO expression; expression: expression CONCAT expression; expression: MINUS expression %prec UNARY; expression: expression LT expression; expression: expression GT expression; expression: expression LE expression; expression: expression GE expression; expression: expression EQ expression; expression: expression NE expression; expression: expression AND expression; expression: expression OR expression; expression: expression ASSIGN expression; expression: IDENTIFIER; expression: STRING; expression: NUMBER; expression: TRUE; expression: FALSE; expression: NULL; expression: function_expression; function_expression: IDENTIFIER BRACKET_LEFT arguments BRACKET_RIGHT; arguments: arguments COMMA argument; arguments: argument; arguments: ; argument: expression; %%
5. Stack-based execution of instructions
With the PCRE lexer and the Bison parser, we could already build the compiler's front end. Now we will build the compiler's back end. First, let us look at how a simple virtual stack machine works. For the EL language, we only need three virtual machine instructions:
- PUSH_OPERAND. Push an operand on the stack.
- EXECUTE_OPERATOR. Pop the number of arguments required from the stack; execute the operator's function; push the result back on the stack.
- RESET_STACK. After completing the evaluation of an expression, remove everything from the stack, so that the next expression to evaluate can start with a clean stack.
I will illustrate how the EL stack machine works with simple examples. Our machine has a stack and a hashtable with variables. Initially, the stack is empty, and the memory contains variable b with value 8.
Show/Hidden php code
Now we want to evaluate the following expression:
Show/Hidden php code
The first thing to do, is to transform the expression in into reverse Polish notation (RPN), also called postfix notation. Our expression becomes:
Show/Hidden php code
The RPN instructions to execute this expression on the stack, become:
Show/Hidden php code
Let us look at how the stack and the variables evolve with each instruction:
Show/Hidden php codeinstruction stack variables [ ] {b:8} push a [ a ] {b:8} push b [ a 8 ] {b:8} push 3 [ a 8 3 ] {b:8} exec + [ a 11 ] {b:8} take the last two elements from the stack and replace them by their sum exec = [ ] {a:11 b:8} take the last two elements from the stack and assign the second value to the first name
The Bison parser will automatically reduce the rules for the operands and operators in the correct RPN order, by using the priority rules in the EL grammar file. Therefore, we can write out our push and execute instructions in the same order as the parser will call back our program. A function call is also just an operator. For example:
Show/Hidden php code
Transformed to RPN:
Show/Hidden php code
Let us look at how the stack and the variables evolve with each instruction:
Show/Hidden php codeinstruction stack variables [ ] {a:11 b:8 c:4 f:<func> y:3} push x [ x ] {a:11 b:8 c:4 f:<func> y:3} push a [ x 11 ] {a:11 b:8 c:4 f:<func> y:3} push y [ x 11 3 ] {a:11 b:8 c:4 f:<func> y:3} exec + [ x 14 ] {a:11 b:8 c:4 f:<func> y:3} push b [ x 14 8 ] {a:11 b:8 c:4 f:<func> y:3} push c [ x 14 8 4 ] {a:11 b:8 c:4 f:<func> y:3} push f [ x 14 8 4 <func> ] {a:11 b:8 c:4 f:<func> y:3} exec invoke(4) [ 12 ] {a:11 b:8 c:4 f:<func> y:3} take the last element from the stack as the function to call, and then call it with the last 3 values on the stack as arguments. After completing the function call, put the result on the stack
5. Associating callback functions to rules reduced by the parser
In standard Bison practice, we can associate statements in C/C++ next to the rule. The C/C++ parser will invoke the statements in curly braces after reducing a rule. The numbered variables $1, $2, and so on, represent tokens reduced in the right-hand side of the rule:
Show/Hidden php codeexp: NUM { $$ = $1; } //rule 1 | exp PLUS exp { $$ = $1 + $3; } //rule 2 | exp MINUS exp { $$ = $1 - $3; } //rule 3 | exp MULTIPLY exp { $$ = $1 * $3; } //rule 4 | exp DIVIDE exp { $$ = $1 / $3; } //rule 5 | MINUS exp %prec NEG { $$ = -$2; } //rule 6 | exp EXPONENTIATE exp { $$ = pow ($1, $3); } //rule 7 | BRACKET_LEFT exp BRACKET_RIGHT { $$ = $2; } //rule 8
We will, however, associate Php statements, to be called after reducing each rule, instead of C/C++ statements. This cannot be done in the Bison grammar itself, because Php statements are not valid C/C++ code. Since each rule has its own sequential number, we could use the rule numbers to associate a Php callback function to a particular rule. However, it would quickly lead to unmaintainable situations. As soon as we change the grammar, and add or remove a rule somewhere, the entire numbering would start moving; and all our associations would become wrong.
So, I came up with another solution. We will associate rules to callback functions by matching a rule for a feature that is unique to the rule.
-
- associateRuleNumberForRHSSymbol($symbol). This method is appropriate, if the rule is the only one for a particular righthand-side symbol. For example, rule 2 is the only one having a PLUS symbol on the righthand site. This rule's name automatically becomes 'PLUS'.
- associateRuleNumberForLHSSymbol($ruleName,$lhsSymbol). This method is appropriate, if a particular lefthand-side symbol only appears one time on the left. In the example Bison grammar above, there is no such rule. They all have the 'exp' symbol as lefthand-side symbol.
- associateRuleNumberForRHSSymbolAndNumberOfTerms($ruleName,$symbol,$numberOfTerms). This method is appropriate, if the rule has a particular righthand-side symbol as well as a given total number of righthand-side symbols. For example, rule 6 has a MINUS symbol and a total of 2 terms, while rule 2 also has a MINUS symbol, but a total of 3 terms.
- associateRuleNumberForLHSAndRHSSymbols($ruleName,$lhsSymbol,$rhsSymbol). This method is appropriate, when there is a unique combination of the lefthand-side symbol and one righthand-side symbol, as opposed to the righthand-side symbol taken alone, being unique. In the example Bison grammar above, there is no such rule.
After naming the rules uniquely with a rule name, We can now associate callback functions to the unique rule names. For example:
Show/Hidden php code
The parser will call back the callback function with a reference to itself ($me) along with the righthand-side tokens in the rule reduced ($stackStates). The righthand side for the unary minus rule is: MINUS exp. Therefore, the operator can be found in the first position of the righthand side, that is, $stackStates[0] (counting from zero). The compiler will then emit a new EXEC_OPERATOR instruction from this token, indicating that it only takes one item from the expression stack.
For binary operators, that is, operators that pop two items from the stack, the compiler will emit an EXEC_OPERATOR instruction for the token. Examples of binary operators are the PLUS, MINUS, MULTIPLY operators:
Show/Hidden php code
For function calls, the parser needs to indicate the number of arguments in the function call, for the VM to know how many arguments it should pop from the stack, when executing the function call:
Show/Hidden php code$this->registerNamedReduceCallback( 'FUNCTION_CALL', function($me,$stackStates) { $token=$stackStates[0]->token; $argCount=$me->calcArgCount($stackStates); $me->emitOperand($stackStates); $instruction=Instruction::fromToken( $token, 'EXEC_OPERATOR', $argCount+1); $instruction->symbol='FUNCTION_CALL'; $instruction->value=''; $me->emit($instruction); } ); }
Since the $stackStates variable contains the entire parse tree for the righthand-side of the rule, the parser can descend it, and count the number of times that the rule with lefthand-side 'argument' appears in it. The number of arguments that the FUNCTION_CALL operator needs to take from the stack, is the number of function arguments plus one, for the function name itself.
6. The bytecode produced
In this example, we emit the bytecode, that is, the intermediate language, in plain text. Compressing the intermediate language into actual bytecode is mainly done for size and speed. In our case, we prefer bytecode that is easy to read. Therefore, we leave it in text format.
There are only three bytecode instructions in the EL language: PUSH_OPERAND, EXEC_OPERATOR, and RESET_STACK. You can compile an EL program to bytecode with the elc compiler:
Show/Hidden bash code
The bytecode for the main example:
Show/Hidden php code---- |----|------ |--- |---|---|--- |----- code |args|symbol |pos |len|lin|col |value ---- |----|------ |--- |---|---|--- |----- PUSH_OPERAND | |IDENTIFIER |37 |4 |2 |1 |var1 PUSH_OPERAND | |NUMBER |42 |2 |2 |6 |33 EXEC_OPERATOR |2 |ASSIGN |41 |1 |2 |5 |= RESET_STACK | |SEMICOLON |44 |1 |2 |8 |; PUSH_OPERAND | |IDENTIFIER |46 |4 |3 |1 |var2 PUSH_OPERAND | |NUMBER |51 |2 |3 |6 |45 EXEC_OPERATOR |2 |ASSIGN |50 |1 |3 |5 |= RESET_STACK | |SEMICOLON |53 |1 |3 |8 |; PUSH_OPERAND | |IDENTIFIER |55 |4 |4 |1 |$gdp PUSH_OPERAND | |NUMBER |61 |1 |4 |7 |9 EXEC_OPERATOR |1 |UNARY_MINUS |60 |1 |4 |6 |- PUSH_OPERAND | |NUMBER |63 |3 |4 |9 |1.1 EXEC_OPERATOR |2 |MULTIPLY |62 |1 |4 |8 |* PUSH_OPERAND | |NUMBER |67 |1 |4 |13 |1 EXEC_OPERATOR |2 |MINUS |66 |1 |4 |12 |- EXEC_OPERATOR |2 |ASSIGN |59 |1 |4 |5 |= RESET_STACK | |SEMICOLON |68 |1 |4 |14 |; PUSH_OPERAND | |IDENTIFIER |70 |4 |5 |1 |var3 PUSH_OPERAND | |IDENTIFIER |76 |4 |5 |7 |var1 PUSH_OPERAND | |IDENTIFIER |81 |4 |5 |12 |var2 EXEC_OPERATOR |2 |PLUS |80 |1 |5 |11 |+ PUSH_OPERAND | |IDENTIFIER |87 |4 |5 |18 |var2 EXEC_OPERATOR |2 |DIVIDE |86 |1 |5 |17 |/ EXEC_OPERATOR |2 |ASSIGN |74 |1 |5 |5 |= RESET_STACK | |SEMICOLON |91 |1 |5 |22 |; PUSH_OPERAND | |STRING |101 |10 |6 |9 |var1 is: PUSH_OPERAND | |IDENTIFIER |112 |4 |6 |20 |var1 EXEC_OPERATOR |2 |CONCAT |111 |1 |6 |19 |. PUSH_OPERAND | |IDENTIFIER |93 |7 |6 |1 |println EXEC_OPERATOR |2 |FUNCTION_CALL |93 |0 |6 |1 | RESET_STACK | |SEMICOLON |117 |1 |6 |25 |; PUSH_OPERAND | |STRING |127 |10 |7 |9 |var2 is: PUSH_OPERAND | |IDENTIFIER |138 |4 |7 |20 |var2 EXEC_OPERATOR |2 |CONCAT |137 |1 |7 |19 |. PUSH_OPERAND | |IDENTIFIER |119 |7 |7 |1 |println EXEC_OPERATOR |2 |FUNCTION_CALL |119 |0 |7 |1 | RESET_STACK | |SEMICOLON |143 |1 |7 |25 |; PUSH_OPERAND | |STRING |153 |10 |8 |9 |var3 is: PUSH_OPERAND | |IDENTIFIER |164 |4 |8 |20 |var3 EXEC_OPERATOR |2 |CONCAT |163 |1 |8 |19 |. PUSH_OPERAND | |IDENTIFIER |145 |7 |8 |1 |println EXEC_OPERATOR |2 |FUNCTION_CALL |145 |0 |8 |1 | RESET_STACK | |SEMICOLON |169 |1 |8 |25 |; PUSH_OPERAND | |IDENTIFIER |171 |5 |9 |1 |start PUSH_OPERAND | |NUMBER |177 |1 |9 |7 |1 PUSH_OPERAND | |IDENTIFIER |179 |4 |9 |9 |var1 PUSH_OPERAND | |NUMBER |184 |1 |9 |14 |3 EXEC_OPERATOR |2 |EXPONENTIATE |183 |1 |9 |13 |^ EXEC_OPERATOR |2 |PLUS |178 |1 |9 |8 |+ EXEC_OPERATOR |2 |ASSIGN |176 |1 |9 |6 |= RESET_STACK | |SEMICOLON |185 |1 |9 |15 |; PUSH_OPERAND | |STRING |195 |11 |10 |9 |start is: PUSH_OPERAND | |IDENTIFIER |207 |5 |10 |21 |start EXEC_OPERATOR |2 |CONCAT |206 |1 |10 |20 |. PUSH_OPERAND | |IDENTIFIER |187 |7 |10 |1 |println EXEC_OPERATOR |2 |FUNCTION_CALL |187 |0 |10 |1 | RESET_STACK | |SEMICOLON |213 |1 |10 |27 |; PUSH_OPERAND | |NUMBER |223 |1 |11 |9 |4 PUSH_OPERAND | |IDENTIFIER |215 |7 |11 |1 |println EXEC_OPERATOR |2 |FUNCTION_CALL |215 |0 |11 |1 | RESET_STACK | |SEMICOLON |225 |1 |11 |11 |; PUSH_OPERAND | |STRING |235 |13 |12 |9 |some string PUSH_OPERAND | |IDENTIFIER |227 |7 |12 |1 |println EXEC_OPERATOR |2 |FUNCTION_CALL |227 |0 |12 |1 | RESET_STACK | |SEMICOLON |249 |1 |12 |23 |; PUSH_OPERAND | |IDENTIFIER |283 |1 |13 |1 |x PUSH_OPERAND | |IDENTIFIER |285 |4 |13 |3 |var1 PUSH_OPERAND | |IDENTIFIER |290 |4 |13 |8 |var2 EXEC_OPERATOR |2 |PLUS |289 |1 |13 |7 |+ EXEC_OPERATOR |2 |ASSIGN |284 |1 |13 |2 |= RESET_STACK | |SEMICOLON |294 |1 |13 |12 |; PUSH_OPERAND | |STRING |304 |4 |14 |9 |x= PUSH_OPERAND | |IDENTIFIER |309 |1 |14 |14 |x EXEC_OPERATOR |2 |CONCAT |308 |1 |14 |13 |. PUSH_OPERAND | |IDENTIFIER |296 |7 |14 |1 |println EXEC_OPERATOR |2 |FUNCTION_CALL |296 |0 |14 |1 | RESET_STACK | |SEMICOLON |311 |1 |14 |16 |; PUSH_OPERAND | |NUMBER |322 |1 |16 |9 |5 PUSH_OPERAND | |NUMBER |324 |1 |16 |11 |3 EXEC_OPERATOR |2 |PLUS |323 |1 |16 |10 |+ PUSH_OPERAND | |NUMBER |326 |1 |16 |13 |8 EXEC_OPERATOR |2 |MINUS |325 |1 |16 |12 |- PUSH_OPERAND | |NUMBER |328 |1 |16 |15 |2 PUSH_OPERAND | |NUMBER |330 |1 |16 |17 |2 EXEC_OPERATOR |2 |DIVIDE |329 |1 |16 |16 |/ EXEC_OPERATOR |2 |PLUS |327 |1 |16 |14 |+ PUSH_OPERAND | |NUMBER |332 |1 |16 |19 |1 EXEC_OPERATOR |2 |MINUS |331 |1 |16 |18 |- PUSH_OPERAND | |NUMBER |334 |2 |16 |21 |19 EXEC_OPERATOR |2 |PLUS |333 |1 |16 |20 |+ PUSH_OPERAND | |IDENTIFIER |314 |7 |16 |1 |println EXEC_OPERATOR |2 |FUNCTION_CALL |314 |0 |16 |1 | RESET_STACK | |SEMICOLON |337 |1 |16 |24 |; PUSH_OPERAND | |STRING |348 |3 |18 |9 |a PUSH_OPERAND | |STRING |352 |3 |18 |13 |b EXEC_OPERATOR |2 |CONCAT |351 |1 |18 |12 |. PUSH_OPERAND | |IDENTIFIER |340 |7 |18 |1 |println EXEC_OPERATOR |2 |FUNCTION_CALL |340 |0 |18 |1 | RESET_STACK | |SEMICOLON |356 |1 |18 |17 |; PUSH_OPERAND | |IDENTIFIER |359 |1 |20 |1 |y PUSH_OPERAND | |IDENTIFIER |361 |4 |20 |3 |var1 PUSH_OPERAND | |IDENTIFIER |367 |4 |20 |9 |var2 EXEC_OPERATOR |2 |OR |365 |2 |20 |7 ||| EXEC_OPERATOR |2 |ASSIGN |360 |1 |20 |2 |= RESET_STACK | |SEMICOLON |371 |1 |20 |13 |; PUSH_OPERAND | |STRING |381 |4 |21 |9 |y= PUSH_OPERAND | |IDENTIFIER |386 |1 |21 |14 |y EXEC_OPERATOR |2 |CONCAT |385 |1 |21 |13 |. PUSH_OPERAND | |IDENTIFIER |373 |7 |21 |1 |println EXEC_OPERATOR |2 |FUNCTION_CALL |373 |0 |21 |1 | RESET_STACK | |SEMICOLON |388 |1 |21 |16 |;
Now we just need a virtual machine to execute the bytecode.
7. The Virtual Machine
The Virtual Machine (VM) needs to read the bytecode, line by line, and then execute its instructions.
- PUSH_OPERAND: In our VM, the address of an operand, is represented by its name in the 'variables' hashtable. When pushing an operand, the VM does not know if it will need its value or its address, when later on, it pops it again from the stack. Therefore, the VM always attempt to push both address and value. However, there may be no address, or there may be no value to push on the stack. This is not a problem in itself, but could become one. If the operator needs a value, and the stack item only contains an address, then there is a problem. In that case, we are trying to use the value of an uninitialized variable. If the operator needs an address, and there is only a value. There is also a problem. We are trying to assign to an r-value. For example: the assignment, 3=a, is illegal, because 3 has no address.
- EXEC_OPERATOR: In our VM, executing an operator, will usually require to pop a given number of arguments from the stack. We will always check first, that the stack has enough items left.
Most operators are binary operators. For example:
Show/Hidden php code
The VM will pop two arguments from the stack, and call the callback function for the 'PLUS' operator with the two values. The function will then execute the operator's function, and push the result on the stack (pushResult()).
There are two instructions that are slightly more complex. The ASSIGN instruction will pop two items from the stack and attempt to assign $item2->value2 to $item1->name. If $item2->value does not exist, it means that the variable has a name, but no value yet. In that case, we trying to assign an uninitialized variable. In the case that $item1->name does not exist, it is not an lvalue, but an rvalue, that we are dealing with. We validate the items popped from the stack for these potential problems:
Show/Hidden php code$this->registerOperatorCallback('ASSIGN', function ($vm,$vmInstruction) { //get two items from the stack $stackItem2=$vm->popItem($vmInstruction); $stackItem1=$vm->popItem($vmInstruction); //item 1 $hasName1=$stackItem1->hasName; $name1=$stackItem1->name; $hasValue1=$stackItem1->hasValue; $value1=$stackItem1->value; //item 2 $hasName2=$stackItem2->hasName; $name2=$stackItem2->name; $hasValue2=$stackItem2->hasValue; $value2=$stackItem2->value; //fix the error message (give a name value2 if possible) if($hasValue2) $msg2=" or '$value2' "; else $msg2=''; //item 1 must have a name if(!$hasName1) $vm->error($vmInstruction, "Cannot assign value of '$name2'$msg2". " to variable '$value1'"); //item 2 must have a value if(!$hasValue2) $vm->error($vmInstruction, "Undefined value for variable '$name2'"); //now we can assign $vm->variables[$name1]=$value2; } );
The FUNCTION_CALL operator can call any function that has been registered inside the VM:
Show/Hidden php code$this->registerOperatorCallback('FUNCTION_CALL', function ($vm,$vmInstruction) { //get function name off the stack $functionName=$vm->popItem($vmInstruction)->name; //now pop the correct number of function arguments from the stack $args=array(); for($i=1; $args; $i++) { $arg=$vm->popItem($vmInstruction)->value; array_unshift($args,$arg); } //if the function exists, call it with the arguments retrieved //otherwise, it's an undefined function if(array_key_exists($functionName,$vm->variables)) { $function=$vm->variables[$functionName]; $vm->pushResult($function($vm,$vmInstruction,$args)); } else { $vm->error($vmInstruction, "Undefined function '$functionName'"); } } );
I only registered 'println' as a built-in function:
Show/Hidden php code
For the 'println' function, the VM needs to check that the number of arguments is exactly one. Further, a function must always return a value; even if it will be thrown away. Imagine a function does not push any return value on the stack. For example, f(x) does not return any value, but is still accidentally used in the following expression: g(f(x)). When calling the function g(), the VM will take an argument from the stack, but it will not be f's function result. It may be anything. It could also lead to attempting to pop from an empty stack.
As you can see, the 'println' function does not do much. It only echoes its argument to stdout.
A function is just a variable, just like any other variable identified by an identifier. Indeed, in order to know what function to call, the VM must get its name off the stack first. This principle allows the functions registered with the VM to be first-class functions. Which function to call, can therefore be the result of a calculation. This is exactly what is needed, to implement, for example, object-orientation.
The main loop in the VM keeps executing instruction after instruction, until it is done:
Show/Hidden php codefunction execVMInstructions() { $this->resetStack(); foreach($this->vmInstructions as $vmInstruction) { $code=$vmInstruction->code; if(array_key_exists($code,$this->instructionCallbacks)) { $function=$this->instructionCallbacks[$code]; $function($this,$vmInstruction); } else { $this->error($vmInstruction, "Unknown IL instruction code: '$code'"); } } }
In the implementation that you can download at the top of this blog post, you can use the vm and el scripts to execute EL bytecode:
Show/Hidden php code
8. Conclusion
The EL virtual machine is a relatively slow virtual machine built on top of Php's virtual machine. It can only execute simple expressions. Time permitting, in my next blog post, I will add support for 'if' and 'while' statements as well as for array variables to the EL compiler and VM.



