Rainforest

Sankuru

Implementing, customizing, extending, and troubleshooting Joomla/Virtuemart

Views: 2571
SocialTwist Tell-a-Friend

Machine translation

English Arabic Chinese (Simplified) German Japanese Russian Spanish



Re-use open source

What you need, often exists already, and covers your requirements for 80%. We will add the remaining 20% for you.

Free quote

Request a free quote today.

The EAV database performance problem PDF Print E-mail
User Rating: / 1
PoorBest 
Written by erik   
Saturday, 30 May 2009 08:24

I think it should be possible to cut down on the source code bloat that plagues EAV applications, by using a correctly implemented rewrite rule.

The other typical problem associated with EAV databases, is their notorious performance problems. EAV databases are known to be slow, and even slower when they are under load.

Contrary to relational, non-EAV databases, EAV databases do not scale well. The larger the data set managed, and the larger the number of users, the slower the application becomes.

 

Database blocks and fields

The EAV performance problem occurs at the database engine level. Take, for example, the following query:

select c1,c2 from T where c3=v;

The relational engine will usually store table rows in one (or a few) contiguous blocks, and process results more or less as following:

  • use the index to locate the ids for rows satisfying the condition c3=v
  • locate the address for the blocks for each such row
  • retrieve the one (or few) blocks
  • parse the fields out of the blocks


In the case of EAV databases, each field in a row, is stored in its own row. The database engine will have to process the following query:

select V1.value, V2.value from values V1 left join (V2,V3) on V1.rowid=V2.rowid and V1.rowid=V3.rowid
where
    V1.table='T'
and V2.table='T'
and V3.table='T'
and V1.field='c1'
and V2.field='c2'
and V3.field='c3'
V3.value=v;

  • use an index to locate the ids for rows satisfying the condition V3.value=v
  • locate the address of the first field value
  • retrieve the first field value
  • locate the address of the second field value
  • retrieve the second field value
  • merge the fields into one query result set

 

The problem is not so much the fact that the database engine can read a relational, non-EAV table block by block, instead of reading it field by field, for the EAV database. This slows down the database, but not that dramatically, I guess. The real problem lies elsewhere. 

 

Indexing an EAV database

In a relational, non-EAV database, the database engine is able to apply the query condition, through the use of an index, to filter tables before implicating them into the query results. It spares the database engine the task of retrieving and evaluating lots and lots of rows, that the engine actually does not need to add to the query results.

In relational, non-EAV databases, you can speed up the database, by creating an index over column c3. The more distinct values there are in column c3, the more intensely the database engine can pre-filter the query results. Therefore, the use of indexes can often speed up query processing tremendously.

In an EAV database, however, you cannot create an index just over column c3. All field values are stored in a table similar to:

values(table,field,rowid,value)

The only index possible, that will speed up processing queries that contain the condition V.value=v, will consist of:

index_eav=(table,field,rowid,value)

This is the entire values table itself.

It may be possible for the database engine to use index_eav to speed up results. However, this is not necessarily the case. In order to dramatically speed up the processing, is also necessary that the database engine evaluates the query condition first, in order to cut down on rows to process. It must start by performing the following function:

{rows}(V3)=filterTable(values, V3.table='T' and V3.field='c3' V3.value=v)

If the database engine processes the query clauses in any other order, it will waste its resources in operations that perform complete table scans over table T. I somehow guess that a sufficiently intelligent database engine will be able to use index_eav to cut down on the number of rows to process.

 

Excessive joining in an EAV database

The next problem is more tricky, however. The database engine will stil have to join the EAV filter results:

{rows}(V1)=filterTable(values, V1.table='T' and V1.field='c1')
{rows}(V2)=filterTable(values, V2.table='T' and V2.field='c2')
{rows}(V3)=filterTable(values, V3.table='T' and V3.field='c3' V3.value=v)

The joining process requires the database engine to sort filter results by rowid first. Unless the index index_eav is sorted by rowid, it will need to produce the entire filter results first, and then sort them by rowid, before merging the results. Therefore, it may make sense to change the order of fields in index_eav:

index_eav=(rowid, field, table, value)

When the filter results are already in the correct sort order for joining over rowid, the database engine does not need to produce the entire set of filtered rows, before merging them. It is probably somehow possible to avoid this kind of sort operations.

Another problem, is the fact that the database engine must merge the rows over the V1 and V2 filter results. This amounts to effectively evaluate all rows for table T twice. This operation, therefore, amounts to doing a complete table scan over table T for each column in T involved in the query results. Any query involving fields from a large table stored in EAV style will therefore be slow.

The database engine would be able to avoid this problem, by producing the filter results for V3 only, and then look up the corresponding columns c1 and c2 in the values table, for each row in V3:

{rows}(V3)=filterTable(values, V3.table='T'and V3.field='c3' V3.value=v)

for each row in {rows}(V3)
    value1=lookupValue(table='T',field='c1',rowid=<rowid>)
    value2=lookupValue(table='T',field='c2',rowid=<rowid>)

So, the database engine should pre-filter the results for fields that have additional query conditions associated to them, while looking up fields that have no conditions associated to them.

I doubt that even a very intelligent database engine will choose this strategy automatically, in order to evaluate this query. From the point of view of the database engine, V1 and V2 do have conditions associated, that is, the requirement that table='T' and that field='c1' or 'c2'. The database engine cannot know that these conditions are just EAV conditions, and that they should not count towards the number of conditions associated with a particular column.

Therefore, if one insists on using EAV tables, one should limit the use of EAV to relatively small master data tables, such as products or product categories. The use of EAV for larger, and continuously growing, master data tables such as customers, will cause the application performance to automatically degrade over time; and of course, it is not a good idea at all, to use EAV for large sets of rapidly growing transaction data, such as orders or invoices.

 

Conclusion

I guess that it is somehow possible to speed up an EAV database, by indexing the EAV values table judiciously. This should make it possible to have somewhat decent processing speed for relatively simple queries.

But then again, if a query is a bit more complicated, the corresponding EAV query will be extremely complicated. It invariably requires the database engine to use several stages in merging query results. There will be a lot of scope for things to go wrong.

Therefore, getting acceptable performance out of an EAV-based application with performance problems, requires extensive database tuning, based on analyzing extensively the database query plans for the queries typically involved in producing data for the application.

EAV databases, therefore, create the need for database experts to do some serious database babysitting. At the same time, it is not even sure that the efforts in database tuning will produce the performance improvements desired.

It would be interesting to figure out how much of a performance improvement can be had, by database-tuning an EAV application such as Magento, instead of just throwing additional hardware at the problem.
 


blog comments powered by Disqus
 
 
Joomla 1.5 Templates by Joomlashack