Rainforest

Sankuru

Implementing, customizing, extending, and troubleshooting Joomla/Virtuemart

Views: 1067
SocialTwist Tell-a-Friend

Machine translation

English Arabic Chinese (Simplified) German Japanese Russian Spanish



Re-use open source

What you need, often exists already, and covers your requirements for 80%. We will add the remaining 20% for you.
partially self-hosting the Google v8 javascript engine PDF Print E-mail
User Rating: / 1
PoorBest 
Written by erik   
Wednesday, 11 November 2009 07:08

As I argued in one of my previous blog posts, the essential guidelines in making the choice whether to implement a function in C or in scripting, are:

  • The performance gains derived from (re-)implementing the function in C instead of scripting have to be worth the effort. Therefore, it must concern a function that is called in a deeply embedded innermost loop. Otherwise, the performance impact will not be sufficiently noticeable to justify spending the effort.
  • The more the function is highly polymorphic, the less likely a C implementation will be faster. Re-implementing a highly polymorphic function in C will not necessarily make it faster. If as a developer, you are not more proficient at polymorphizing functions manually, than the scripting engine does automatically, you will make the performance situation worse, instead of better.

 

The general monomorphic (unpolymorphic) function call can be represented as:

monomorphic: typen+1=f(type1,type2,...,typen)

The general polymorphic method call can be represented as:

polymorphic: object=(find_function(f,object)) (object,object,...,object)

The optimal choice between C and scripting, is determined by the nature itself of the function.

For example, concerning the function substr:

string=substr(string ,int start [, int length])

Anybody would be hard pressed to find a meaningful polymorphic generalization to it. First, you need a context in which it would make sense to call either this function or a function that operates on another data type. What's more, any reasonable strategy will avoid easily avoidable polymorphization, instead of trying to introduce it.

Furthermore, given the function's inherently monomorphic nature, it will be as easy (or as hard) to implement it in C as in scripting; while the version in C will be substantially faster.

Obviously, one could write instead:

string=string->substr(int start [, int length]

Using syntactic sugar does not, however, change anything to the monomorphic nature of the substr function. The function is inherently monomorphic, and will remain so, unless our intepretation of what a string is, changes. That will not happen any time soon.

 

Re-implementing native C functions in scripting

The C binary interface of a native library exposes a collection of monomorphic functions:

typen+1 f(type1,type2,...,typen)

A pluggable re-implementation of a native C function in scripting language, requires exposing the same binary interface. For number types and strings, marshalling the function arguments and return value is relatively easy. The problem occurs when any of these types are object types. But then again, object types can be represented as hash tables. Therefore, marshalling data between C and scripting language, requires mapping each data type to one of either types:

  • an integer type: int32, int64
  • a floating type: double
  • a string: a utf8 string, represented in C by (char *)
  • a hash table: the hash table type, which forms the basis of all the scripting language's object types

 

There are essentially 5 basic data types involved in scripting: int32, int64, double, string, and hash table. We could optionally accept type hinting. If the developer indicates that the function argument is int32, int64, double, or string, the native API should only accept the chosen type. Otherwise, the type is a hash table.

The developer should also indicate what should be the native name for the function. For example:

<int32> function f(a1<int32>, a2 <string>, a3) native coreapp_fn { ... }

  • The function argument a3 would be a hash table.
  • The function would be exported as coreapp_fn.
  • Only functions exported as native C functions would require this kind of decoration

 

Saving the generation of the native code compiled from the script

In order to swap native C functions and scripting functions, it would make sense to save to have the ability to save the native code generated from scripting language. This mostly requires packaging it in a supported executable format. For linux, it is the ELF executable packaging format that needs to be used. For windows, it is the PE packaging format that needs to be targetted.

Gaining the ability to replacing native C code by scripting code, requires duck punching the script's package into a native executable or platform library, and expose the script's entrance point or multiple entrance points (API) in the native C binary API. Since the google v8 implementation generates native code instead of intermediate byte code, it should be possible to turn v8 into a compiler too.

 

A common binary API for both C and scripting

The ability to interchange C and scripting implementations, requires a common native binary API. Therefore, it must be possible to expose Javascript through the existing C API. It is probably a good idea to re-use a simplified version of the name-mangling solution developed for C++. Since all object types in scripting are implemented in terms of the native hash table, the solution would work, without degenerating into similar attempts as made in C++ to map every possible object type as yet another data type. From a native point of view, these objects are just hash tables. We really don't need more than that, in terms of interfacing between C and scripting language. C structs pointers already end up as some kind of opaque userdata pointer types in most scripting languages. That solution is definitely good enough.

 

Re-implementing scripting functions in native C

One polymorphic method call with n function arguments, represented as:

object (find_function(f,object)) (object,object,...,object)

can be implemented as 5^(n+1) monomorphic function calls, represented as:

typen+1 f(type1,type2,...,typen)

With 5 the total number of native data types, that is, int32, int64, double, string, and hash table.

 

The native hash table

The hash table essentially stores (key,value) tuples.

  • key: int32, int64, or string
  • value: int32, int64, double, string, hash table

 

The hash table is an essential native data structure, but one which is highly polymorphic. We have 3 x 5 = 15 different types of hash tables. The Google v8 implementation solves the issue by implementing the hash table in C++ with templates. In objects.h:

template<typename Shape, typename Key>
class HashTable: public FixedArray { ... }

I am personally not necessarily all that happy with this solution, because it drags C++ and its template system into the game, while the original Ousterhout idea would be to use either native C or else scripting, without dragging the one or the other complicated, intermediary, statically-typed object-native failure for a language such as C++, into the fray.

If we manage to solve the highly polymorphic nature of the hash table methods in C, we do not need to solve any other highly polymorphic issues in C, because we would from there on, be able to use scripting for that.

Another issue in the Google v8 implementation, is the fact that the basic object structure is laced with garbage collection support. This is undoubtedly a valid performance hack. But then again, I would expect the source code structure to explicitly reflect separation of concerns; only very carefully overruled by performance hacks. It looks like designing the essential native hash table and dealing with garbage collection was intermingled from the start in the v8 source code. I am against such approach.

Of course, there is also the hidden class performance optimization, which is absolutely commendable. It should be possible to carefully add this again, by splitting the method hash from the attribute hash. This is how v8 does it anyway. Interfacing with C only requires exposing the attribute hash of the objects involved.

 

Self-hosting the google v8 engine

The engine's source code already contains quite a few javascript scripts. However, they mostly seem to expose native functions. I suspect it should be possible to rewrite core parts of the engine in javascript itself, without noticeable performance loss. It would require, however, the ability to save the native code compiled from javascript and duck-punch them to look like native code. There is no need to ever completely self-host the engine, because for some functions a native implementation will always remain preferable.

Multiple script packaging options

Implementing a function in C or in scripting language should not be a strategic choice. The choice should be easily reversible.

It is already possible and relatively easy to expose native C functions in Javascript.

Unfortunately, v8 was written in C++ and not in C. Exposing C++ APIs to scripting languages, is notoriously hard. It requires babysitting the marshalling of type-sensitive C++ objects, that may even be generated from a template, and for which it is relatively complicated to create simple binary C APIs. Dragging C++ into the fray was a regrettable decision.

Exposing javascript functions as native C functions is also very feasible. All object types should just be represented as native C hash tables. The google v8 engine would simply have to implement the option to save the machine code generated to ELF or PE files at the user's request.

Duck punching scripts into native executables, would indeed not be type-safe, but scripting objects are never type-safe anyway. Non-trivial, statically-typed objects are in practical terms also not type-safe. As I argued in a previous post, seeking type safety leads to exponentially growing complexity and is therefore an unattainable goal. As I wrote above, dragging C++ into the fray on grounds of type safety, makes no sense whatsoever. We already have two orthogonal mechanisms to choose from: C and scripting language. What value would C++ be adding, besides complicating everything?

The ability to mix and match C with script language, by using the same native executable packaging and native API interface style, would allow us to easily move implementations -- at the function level -- from C to scripting language and the other way around, without noticeable loss of performance.

A scripting language that can be duck-punched into looking like native executables and platform libraries will blow all other scripting language implementations out of the water.

 

Comparison with other languages

One of the major frustrations of developers using scripting language, is the perceived inferiority of the scripting language artifacts. They do not look or behave like native executables or libraries. Deploying them, requires dragging a complete trashcan of runtimes, virtual machines, additional libraries, and what more into the fray.

For example, besides being an overly verbose and evolutionary dead-end language, Java makes you drag  a complete trashcan (jre-6u17-windows-i586.exe) of a staggering 15.90 MB into the fray, when you deploy the simplest "hello world" Java program in windows.

Adding insult to injury, you must invoke the program, the jar, like a second-class citizen with some kind of strange non-native incantation such as jre -jar myjar.jar.  Next, the java jar will take forever -- and then some -- to start up, until it shows you the most ugly and notoriously buggy user interface widgets, ever devised in the history of mankind. Voilà. There you have your enterprise-grade experience!

The situation with the DOT.NET framework is of course not better. After tiring you out with another overly complex java-style evolutionary dead-end language grammar,  in order to create a "hello world" program, it adds a staggering 30.9 - 78.1 MB to your deployment. What the hell does that trashcan contain? The launch sequence for the new space shuttle?

To some extent, Perl, Python and many other scripting languages suffer from the same disease. Their artifacts do not look like native platform executables or libraries, cannot be invoked or loaded like them, and require dragging a massive runtime trashcan behind your deployment. The excuse that every linux machine has a perl or python engine installed already, does not sound very convincing. Everybody seems to have an excuzilla ready, to explain why you need to drag a massive runtime with lots of dependencies behind you. Php is slightly more nimble and malleable, but not much.

 

Conclusion

Developers want to save time and reduce complexity, by using scripting languages instead of C, when appropriate.  However, developers still cherish the idea of deploying self-contained native executables and libraries. The reason why we use scripting languages is definitely not driven by the desire to drag some kind of runtime trashcan behind our applications.

We should not be forced to choose between an interpreter or a compiler, just for choosing a particular scripting language grammar. Sometimes we like the one deployment option (as scripts), and sometimes the other (as native executables). Why force-feed the developers with only one option, when they want both?

Any scripting language offering both options, will be an instant hit. From Google's v8 effort, something capable of doing this, could eventually emerge.

 

 


blog comments powered by Disqus
 
 
Joomla 1.5 Templates by Joomlashack