|
There are no translations available.
The essence of map-reduce programming is to separate looping constructs out of the algorithm, that is, loop abstraction. For example:
foreach($items as $item) { if($item->country=='CN') { $results[]=processItem($item); } }
becomes:
$results=map( $items, array( CLOSURE_CONDITION=>'$element->country=="CN"', CLOSURE_ACTION=>'processItem($element)' ) );
You are probably already familiar with loop abstraction through the use of SQL. The equivalent statement in SQL would be:
select processItem(field) from items where country='CN';
The implicit looping construct is managed by the functional engine. This has multiple advantages:
- Parallel execution. This is the main reason why Google uses loop abstraction. It allows the functional engine to distribute the processing over a cluster, instead of executing it on one CPU.
- The use of indexes. It can speed up processing tremendously, if the elements to process are indexed. This is how relational databases typically benefit from loop abstraction.
- Lazy evaluation. It can preserve resources by only calculating a result just before it will be used.
- Caching. It can preserve resources to cache the results.
- Materialized views. Instead of re-calculating all results, it is possible to detect the changes in the input collections, and only process the changes, in order to get the new results. The technique is currently used only in a few relational database systems, but it has the potential of speeding up database queries tremendously.
Loop abstraction also allows for substantially shorter source code. I used the technique extensively in my Php LALR1 parser generator.
Map-reduce programming and SQL are very similar. By defining the function relJoin($collection1, $collection2, $joinCondition), it would be possible to generate map/reduce expressions for SQL expressions. For example:
select * from T1 join T2 on T1.key=T2.key and T1.field1='value1' and T2.field2='value2'
would become:
relJoin( select( $t1, array( CLOSURE_CONDITION=>'$element->field1=="value1"' ) ), select( $t2, array( CLOSURE_CONDITION=>'$element->field2=="value2"' ) ), '$collection1->key==$collection2->key' );
As you can see, a relational database system could be implemented by using:
- a (small) scripting engine
- a (small) map-reduce library
- a (small) library to load/save collections to disk
- an LALR1 parser to read the SQL queries and generate the equivalent map/reduce expressions
It would be possible to get very fast relational database systems up and running like this, by optimizing the looping constructs through map-reduce programming.
Especially, the techniques of extensively materializing results and processing changes only, could yield an interesting tradeoff between execution speed and storage space. Indeed, by massively storing redundant intermediate results ("materializing"), it should be possible to increase the execution speed of SQL queries.
Such tradeoff between speed and storage space, is sorely lacking in existing relational database management systems. Introducing it, would massively improve relational database scaleability.
So, on the one side, I think that Google is right. Map-reduce programming is the underlying way to build massively scaleable database systems. What's more, the proof is in the pudding. Google's systems prove indeed that they are massively scaleable.
On the other hand, however, since SQL queries can easily be translated to map-reduce expressions, there is no need to remove SQL from the picture and therefore affect backward compatibility. We can have our cake and eat it too.
MySQL is implemented in highly optimized C. In the greater light of things, this is both an advantage and a disadvantage. It makes the MySQL database engines fast and small in terms of program footprint.
However, without the flexibility of an underlying scripting engine, it will be hard, if not impossible, to re-base the database engine on a highly optimized map-reduce library.
At the same time, highly optimized C source code is difficult to read. It discourages attempts to experiment with better scaling strategies. Additional innovation gets stuck in the existing implementation, because few people can work on such source code.
|