The FOR looping statement. A semantic compiler plugin
- Details
- Published on Monday, 09 Jan 2012
- Hits: 1802
1. Download the simple compiler and virtual machine in Php, with support for the FOR statement
2. Recapitulation of previous blog posts
Let's continue to build our little bytecode compiler in Php, along with a small virtual stack machine to execute the resulting bytecode. In previous blog posts we dealt with the following issues:
- A pluggable compiler and virtual machine in Php. Dogfood yourself with your own plugins!
- Adding support for if, while, and do while, to a simple compiler and virtual machine written in Php
- A simple bytecode compiler with virtual machine, written in Php, for the EL language
- Using Bison as a parser generator for Php
- Using the PCRE library as a lexer in Php
3. Example of a FOR statement
The traditional FOR statement looks like this. Example:
The statement loops for the number of times specified in the for-clause expressions over an instruction or a block of instructions. The FOR statement was originally introduced by the C language. All its descendents seem to have it: C++, Java, C#, Php, and so on. In this blog post, we will be implementing a semantic compiler plugin for the FOR statement.
I do not know of any other compiler that is pluggable. Therefore, I cannot compare on how to add support for a builtin language element in other compilers. Pluggability is only a requirement for this compiler. It may not be one for other compilers.
3. The FOR statement could also be a function
Before implementing the FOR statement using our plugin system I would like to underline that the IF, WHILE, DO WHILE, and FOR statements could simply be functions. There is no real technical need to build these into language. The corresponding functions would look like this:
The statements embedded inside the IF, WHILE, DO WHILE, and FOR statements would simply be anonymous functions. For example:
This would be a more LISP-like implementation for the IF statement. This approach could be more palatable to programmers not used to using anonymous functions, aka, lambdas, if it did not cause so much syntactic overhead, mostly from the way in which anonymous functions are typically written:
The expression could also be simplified to something like:
Or something similar. The result would look like:
By doing this, however, we would really stray from current programming practices. We should probably not comtemplate turning the IF, WHILE, DO WHILE, and FOR statements into functions, as McCarthy did for LISP, because it would probably incur the wrath of the majority of the programmer community who are used to built-in language constructs for these statements.
The largest source for the LISP's unpopularity, however, is undoubtedly its unorthodox prefix notation:
What the virtual stack machine needs, is postfix. What we are used to, is infix. What LISP does, is prefix. Math at school is based on infix. Therefore, few people agree to write expressions in prefix or postfix notation. We must indeed stay compatible with all the reasonable and also unreasonable choices, and sometimes even all the horrors of the past. I think we will never adopt prefix notation.
Another reason for LISP's relative unpopularity is the way in which it strays from the C-style syntax even though LISP was there before C. Nowadays, the vast majority of programmers is used to languages with a C-style syntax. Straying from the C-style syntax seems to be one of the biggest no-go zones in programming language design. Python too, would be more popular if it had not strayed from the C syntax. That, together with Python's choice for significant whitespace is keeping it from attaining the place it would otherwise deserve. Insisting on the use of recursion instead of looping is also not making LISP any more popular. When given a choice most programmers would use a loop instead of a recursive function. I do not think recursion is inherently better than looping. It is more powerful but its use only makes sense when you actually need that power. Otherwise, it just makes the program more difficult to understand. The use of map/reduce (fold) functionals, however, is still very underrated.
So, for our demo compiler that means: no prefix instead of infix notation, no pushing recursion as a preferred looping approach, no significant whitespace, no straying from the C-style syntax, and no turning the builtin statements into functions.
None of that would help making the point but it would definitely send lots of mainstream programmers cringing. Moving in the direction of LISP must indeed be done with utmost precaution. Ten years ago, reaching the largest possible group of programmers would have meant implementing the compiler in Visual Basic with all the licensing issues associated. Today it probably means doing it in Php, luckily, without any annoying licensing issues attached.
Up till now the approximate distribution of our compiler construction efforts looks as following:
| 50% | infix to postfix translation |
| 30% | syntactic sugar |
| 20% | really needed |
As you can see, most of the effort in building a compiler and virtual machine goes into pleasing the programmer and his traditional conventions and not into pleasing the machine.
4. Building the plugin
4.1. Analysis of the FOR statement
In the FOR statement we first execute the init_statements. Next, we evaluate the for_condition. If the for_condition is true we proceed by executing the for_block statements. At the end of the for_block statements we execute the for_next_statements, evaluate the for_condition again and repeat the process by jumping back to the for_block.
The FOR statement could have no condition at all. Conventionally, it means that the condition is always true. Therefore, we inject a true value on the stack each time we enter the for_condition zone. In that way, in absence of a for_condition we impose the for_condition to be simply true. Of course, we generate the bytecode for the embedded expressions and statements in exactly the same order as the compiler manages to parse them. It requires emitting bytecode as following:
4.2. The semantic compiler plugin API
A semantic compiler plugin implements the following facilities:
- onGeneratingGrammar. Instead of creating one monolithical grammar definition file we will add the rules as we add the language constructs. The build-grammar command uses the grammar.template.y file to generate a new grammar.y file by adding the grammar rules one by one, as they have been specified in each semantic compiler plugin.
- beforeLexing. Prior to lexing we must add new keywords and possibly other lexicographical language elements.
- beforeParsing. Prior to parsing we must give a name to each rule to whose reduction we will attach a callback. Next, we need to implement the callbacks themselves.
The result looks like this:
<?php class _For implements ICompilerSemanticPlugin { function onGeneratingGrammar($generator) { $generator->addToken('FOR'); $generator->addGrammarRule('statement: for_statement'); $generator->addGrammarRule('for_statement: for_clause for_block'); $generator->addGrammarRule('for_clause: FOR BRACKET_LEFT for_init_statements SEMICOLON '. 'for_condition SEMICOLON for_next_statements BRACKET_RIGHT'); $generator->addGrammarRule('for_init_statements: for_statements'); $generator->addGrammarRule('for_next_statements: for_statements'); $generator->addGrammarRule('for_statements: for_statements COMMA expression'); $generator->addGrammarRule('for_statements: expression'); $generator->addGrammarRule('for_statements: '); $generator->addGrammarRule('for_condition: for_expression'); $generator->addGrammarRule('for_expression: expression'); $generator->addGrammarRule('for_expression:'); $generator->addGrammarRule('for_block: block'); } function beforeLexing($compiler) { $compiler->addLexerKeywordPatternRule(3,'FOR','for'); } function beforeParsing($compiler) { $compiler->associateRuleNumberForLHSSymbol('FOR_INIT_STATEMENTS','for_init_statements'); $compiler->associateRuleNumberForLHSSymbol('FOR_CONDITION','for_condition'); $compiler->associateRuleNumberForLHSSymbol('FOR_NEXT_STATEMENTS','for_next_statements'); $compiler->associateRuleNumberForLHSSymbol('FOR_BLOCK','for_block'); $compiler->registerNamedReduceCallback( 'FOR_INIT_STATEMENTS', function($me,$stackStates) { pushLabelNumber($me); emitJump($me,$stackStates,'FOR_BLOCK_START'); emitLabel($me,$stackStates,'FOR_CONDITION'); emitInjectedOperand($me,'TRUE','true'); } ); $compiler->registerNamedReduceCallback( 'FOR_CONDITION', function($me,$stackStates) { emitIfTrue($me,$stackStates,'FOR_BLOCK_START'); emitJump($me,$stackStates,'FOR_END'); emitLabel($me,$stackStates,'FOR_BLOCK_LOOP_NEXT'); } ); $compiler->registerNamedReduceCallback( 'FOR_NEXT_STATEMENTS', function($me,$stackStates) { emitJump($me,$stackStates,'FOR_CONDITION'); emitLabel($me,$stackStates,'FOR_BLOCK_START'); } ); $compiler->registerNamedReduceCallback( 'FOR_BLOCK', function($me,$stackStates) { emitJump($me,$stackStates,'FOR_BLOCK_LOOP_NEXT'); emitLabel($me,$stackStates,'FOR_END'); popLabelNumber($me); } ); } } ?>
4.3. Emitter Plugins
While the compiler is parsing a program text, the callbacks attached emit virtual stack machine instructions. The number of instructions needed to operate a virtual stack machine is very small. Up till now we only have: IfFalse, IfTrue, Label, Operand, Operator and BinaryOperator. Now there is one new Emitter plugin, InjectedOperand:
The new emitter plugin emits an instruction to push the operand specified on the stack.
As you can see, the extension functions for the compiler cannot be methods in the compiler class, since there is no way to add plugin methods on the fly without using inheritance. This would create an inheritance chain that is simply impossible to manage. In fact, I pretty much came to believe that inheritance is not just overrated, it is simply a useless facility in traditional OOP. There is nothing you can do with inheritance that you would not be able to do with MyClass::addMethod() or MyClass::replaceMethod(). Furthermore, since a class is just a list (or more precisely, a hash table) with functions and fields, the entire ceremonial class nomenclatura is simply superfluous. When executing the expression $a->f(x), the only thing the compiler needs to do, is to inject $a under the name $this inside the function closure. That is all there is to it. In that respect Javascript has got it entirely right again while C++, Java, and C# have built their entire business around a concept that does not really make much sense. Looking under the hood of things can indeed clarify a lot!
4.4. Mechanism plugins
In order to keep track of the current label number for the control constructs we need a label-tracking mechanism. For example:
Initially, I implemented a label tracking mechanism in the parse trees themselves. It is simpler, however, to keep track of the labels by using a separate label stack:
<?php function pushLabelNumber($compiler) { //increment label number $compiler->lastLabelNumber++; //push it on the label-number stack $compiler->labelNumberStack[]=$compiler->lastLabelNumber; } function lastLabelNumber($compiler) { //watch out for empty stack if(count($compiler->labelNumberStack)==0) $compiler->error('Illegal attempt to look up label number in empty stack'); //return last stack item return $compiler->labelNumberStack[count($compiler->labelNumberStack)-1]; } function popLabelNumber($compiler) { //watch out for empty stack if(count($compiler->labelNumberStack)==0) $compiler->error('Illegal attempt to pop label number from empty stack'); //return last stack item array_pop($compiler->labelNumberStack); } ?>
As you can see, the extension functions for the compiler are again not methods in the compiler class. In the prototype that you can download at the top of this blog post, I also applied the new label tracking mechanism to the IF, WHILE, and DO WHILE statements.
4.5. No new VM instruction plugins
There is no need for additional instruction plugins in the virtual stack machine. The existing RESET_STACK, JUMP, LABEL, IF_FALSE, IF_TRUE, PUSH_OPERAND, and EXEC_OPERATOR instructions are still capable of executing the entire language. We are almost reaching the power of typical procedural languages, such as Pascal or Basic, with just these simple instructions. The one major thing still missing, is function definitions.
5. Overall plugin structure
The internal APIs that seem to be emerging from our prototype compiler are much more interesting than the compiler itself. In the end, a good application is nothing more than its extension mechanism. Everything else should be replaceable. Therefore, the true results for the exercise are the details about the way in which the plugin types interlock with each other, that is, how the semantic plugins interlock with the emitter plugins and mechanism plugins. If a solid and stable system emerges, the exercise will have been succesful.
If we compare with other pluggable systems, such as Joomla, we see that Joomla is essentially a choreography of templates, components, modules, (interceptor) plugins, libraries, language packs extensions, and a small core library of mechanisms. The less logic ends up in the core mechanisms, the better. If possible, everything should be replaceable. Practice shows that everything will indeed be replaced. The only irreplaceable part of an application is its extension mechanism.
My prototype joomla-style pluggable compiler has led to the following plugin types:
| Compiler | Semantic | language definitions |
| Emitter | bytecode emission | |
| Mechanism | mechanical logic not classified elsewhere and resulting from a separation of concerns | |
| Virtual machine | Instruction | mirror of compiler instruction emitters; vm instruction executors |
| Operand | string, identifier, number, builtin constants: true, false, null | |
| BinaryOperator | vanilla binary operators: +, -, /, *, ^, ==, <, >, >= ... | |
| Operator | non-binary operators, or non-vanilla binary operators: assign, function call, unary minus, not | |
| Function | builtin functions; only println at the moment | |
| Mechanism | mechanical logic not classified elsewhere and resulting from a separation of concerns |
The more plugins we add, the better the plugin architecture should be able to absorb new functionality and the less it should be forced to yield to pressures for change. We will know if the plugin architecture really holds steadfast after we have implemented a complete language with all the regalia of function definitions and other still missing language constructs.
In answer to the numerous remarks I received concerning the "Of all languages in the world why Php" issue, the actual language in which we construct a plugin architecture is rather irrelevant, compared to the architecture itself. How much time would it take to re-do the architecture in a language of choice if it were really an issue? Furthermore, Php has been holding its beer fine, thank you. It is almost like when people start cringing when they hear that the by far largest-scale application in the world, Facebook, runs on Php. That rotten thing is supposed to be totally unscalable! Reality seems to be violating again their carefully-constructed true belief system!
In my next blog post, I will deal with implementing the BREAK and CONTINUE statements. It will require fixing the label numbering mechanism once more.



