Rainforest

Sankuru

Software services for your Joomla/Virtuemart projects

Views: 359
SocialTwist Tell-a-Friend

Machine translation

English Arabic Chinese (Simplified) German Japanese Russian Spanish



Re-use open source

What you need, often exists already, and covers your requirements for 80%. We will add the remaining 20% for you.

Warranty

3 months free

Quick contact

Fixing Joomla's site search PDF Print E-mail
User Rating: / 0
PoorBest 
Written by erik   
Saturday, 31 October 2009 11:13

 

The Joomla site search function will seach for words or phrases in some or all of the tables in the relational database sitting behind the Joomla site.

 

Users complain about the joomla site search

It is not just the Joomla site search that is known to be unsatisfactory. The Drupal, the Wordpress, and actually almost all CMS site search functions are the subject endless user complaints:

  • The site search doesn't return the results expected.
  • It doesn't find matching data for the search terms.
  • The search results are irrelevant.
  • And so on.

 

It is not just CMS applications which are known to implement unsatisfactory search algorithms. Approximately every application making use of an relational database is subject to the same kind of endless user complaints.

The problem with search functions in relational databases

Search methods will typically emit SQL queries of the following type:

  • select * from table where field like '%{searchterm}%'
  • select * from table where field regexp '{regex}'

 

This approach has its drawbacks:

  • It only searches in one table at a time
  • It is usually not able to use indexes. Therefore, it is usually slow.

 

In the future, we should look at how to allow for typos and synonyms in the search terms.

 

Alternative architecture

I am interested in developing an alternative approach to the strategy of searching through a like or regexp clause.

We already have quite a decent lexer algorithm that simultaneously matches a collection of words at the beginning of a character stream. Thanks to the work done by Kleene on his closures, we can match regular expressions too. The algorithm has the excellent property that it does not matter how many words or regular expressions we want to match at the beginning of the character stream. It goes as fast for one word as for one million.

So, we know how to match a collection of words or expressions at the beginning of a character stream.

Each database table field that we want to search represents a character stream. So, we already know how to search a collection of search terms at the beginning of one field in one row in a database table. However, we want to search:

  • anywhere in the field, not just at the beginning
  • simultaneously for all search terms or expressions
  • in the entire column for that table
  • in more than one column
  • in more than one table


So, now we need to extend the lexer search algorithm to search:

  • anywhere within a character stream, and not just at the beginning
  • in more than one character stream at a time


A traditional lexer state-transition table (STT) looks like:

stt(state_from, character, state_to, rule_matched)

The lexer STT represents the words and expressions we are searching for. The lexer algorithm knows how to search for a collection of expression at the beginning of a character stream S. For example:

S: There is a bridge that allows the road to cross the water.

Searching anywhere in the character stream amounts to searching simultaneously in the following character streams:

  • S0: There is a bridge that allows the road to cross the water.
  • S1: here is a bridge that allows the road to cross the water.
  • S2: ere is a bridge that allows the road to cross the water.
  • S3: re is a bridge that allows the road to cross the water.
  • S4: e is a bridge that allows the road to cross the water.
  • S5: is a bridge that allows the road to cross the water.
  • S6: is a bridge that allows the road to cross the water.
  • S7: s a bridge that allows the road to cross the water.
  • ...
  • S(end-3): er.
  • S(end-2): r.
  • S(end-1): .

 

If the character stream has m characters, searching anywhere in the character stream therefore amounts to searching m times simultaneously in the same character stream, starting from position 0, 1, 2, till m-1. Therefore, if we manage to extend the lexer STT algorithm to search in more than one character stream at a time, we have also fundamentally solved the problem of searching anywhere within the same character stream.

 

Search tables

We are going to need to following tables, in order to implement the alternative search method:

  • search_tables(id,tablename)
  • search_fields(id,tableid,fieldname)
  • search_letter_combinations(id,fieldid,numchars,rowid,position,charss)
  • search_index(id,position,allowable_chars_till_end)


The table search_letter_combinations registers every letter combination in every position in each of the table fields we want to search, from a chosen minimum of k1 up to configurable limit of k2 letters. It can become a massively large table. I somehow sense we have no other choice than maintaining such large index.

The table search_index stores which letters appear from a certain position, until the end of the field content. This table is much smaller. For example, if we consider the following field values:

  • I am tired
  • I like swimming
  • The bridge is large

 

The tables search_letter_combinations and search_index will have the following rows:


search_letter_combinations, numchars=1

  • (1,1,1,1,1,I)
  • (2,1,1,1,2, )
  • (3,1,1,1,3,a)
  • (4,1,1,1,4,m)
  • (5,1,1,1,5, )
  • (6,1,1,1,6,t)
  • (7,1,1,1,7,i)
  • (8,1,1,1,8,r)
  • (9,1,1,1,9,e)
  • (10,1,1,1,10,d)

  • (11,1,1,2,1,I)
  • (12,1,1,2,2, )
  • (13,1,1,2,3,l)
  • (14,1,1,2,4,i)
  • (15,1,1,2,5,k)
  • (16,1,1,2,6,e)
  • (17,1,1,2,7, )
  • (18,1,1,2,8,s)
  • (19,1,1,2,9,w)
  • (20,1,1,2,10,i)
  • (21,1,1,2,11,m)
  • s(22,1,1,2,12,m)
  • ...


search_letter_combinations, numchars=2

  • (45,1,2,1,1,I )
  • (46,1,2,1,2, a)
  • (47,1,2,1,3,am)
  • (48,1,2,1,4,m )
  • (49,1,2,1,5, t)
  • (50,1,2,1,6,ti)
  • (51,1,2,1,7,ir)
  • (52,1,2,1,8,re)
  • (53,1,2,1,9,ed)
  • (54,1,2,2,1,I )
  • (55,1,2,2,2, l)
  • (56,1,2,2,3,li)
  • (57,1,2,2,4,ik)
  • (58,1,2,2,5,ke)
  • (59,1,2,2,6,e )
  • (60,1,2,2,7, s)
  • (61,1,2,2,8,sw)
  • (62,1,2,2,9,wi)
  • (63,1,2,2,10,im)
  • (64,1,2,2,11,mm)
  • (65,1,2,2,12,mi)
  • ...


search_letter_combinations, numchars=3

  • (110,1,2,1,1,I a)
  • (111,1,2,1,2, am)
  • (112,1,2,1,3,am )
  • (113,1,2,1,4,m t)
  • (114,1,2,1,5, ti)
  • (115,1,2,1,6,tir)
  • (116,1,2,1,7,ire)
  • (117,1,2,1,8,red)
  • (118,1,2,2,1,I l)
  • (119,1,2,2,2, li)
  • (120,1,2,2,3,lik)
  • (121,1,2,2,4,ike)
  • (122,1,2,2,5,ke )
  • (123,1,2,2,6,e s)
  • (124,1,2,2,7, sw)
  • (125,1,2,2,8,swi)
  • (126,1,2,2,9,wim)
  • (127,1,2,2,10,imm)
  • (128,1,2,2,11,mmi)
  • (129,1,2,2,12,min)
  • ...

 

The search_letter_combinations table could potentially become a massively large table. It probably makes sense to impose a minimum size for the search term, for example, 3 characters.

The search_index table can be computed as following:

0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5  6  7  8
I     a  m     t  i  r  e  d
I     l  i  k  e     s  w  i  m  m  i  n  g
T  h  e     b  r  i  d  g  e     i  s     l  a  r  g  e

  • (0,'I amtiredlkswmngThb')
  • (1,' amtiredlkswnghb')
  • (2,' amtiredlkswngb')
  • (3,' amtiredlkswngb')
  • (4,' amtiredlkswngb')
  • (5,' amtiredlswng')
  • (6,' amiredlswng')
  • (7,' amiredlswng')
  • (8,' amiredlswng')
  • (9,' amiredlsng')
  • (10,' amirelsng')
  • (11,' amirelsng')
  • (12,' airelsng')
  • (13,' arelng')
  • (14,'arelg')
  • (15,'areg')
  • (16,'reg')
  • (17,'eg')
  • (18,'e')

 

For the purpose of easily maintaining the search_index table, we also need to store the number of times each character is registered in each position. For example:

search_index(14,'a(10343)r(43433)e(2322)l(11111)g(234)')

The search_index table's allowable_chars_till_end field can only become as large as the number of letters in use throughout the database. Imagine the database mostly consists of the characters a-zA-Z0-9, then the typically largest size of the field is 26+26+10=62 characters x 10 =620 characters. We multiply by 10 to take into account the letter-counts.

The number of rows in the table is the length of the largest field in the database. For example, if the largest field is an article record with 20 000 characters. In that case, the maximum size of the table is 20 000 x 620 = 12 400 000 bytes, which is 11.8MB.

No matter how many millions of articles or gazillions of other records you store in the database, the search_index table will only be around 5MB, probably not exceeding the 20MB. The size of the search_index table depends much more on the number of different scripts stored in the database (latin, cyrillic, hebrew, arabic, ...) than on the amount of data stored.

 

Search algorithm

STAGE 1

By looking at the search terms searched for in the database, we have the following typical (memory-based) lexer table available:

stt(state_from,character,state_to,rule_matched)

  • The states are numbered from 0 to the last state
  • A particular character read in state_from will cause the lexer to go state_to
  • If attaining state_to matches a rule, rule_matched will contain the reference for the rule matched

We will revisit the lexer algorithm in a paragraph below.

 

STAGE 2

In possession of the lexer table STT, we can start matching. We are not matching only one character stream at a time, but many character streams simultaneously. If we are matching m different characters streams, which each are l long, we are matching m x l streams simultaneously:

Position 0 (start position)

  • S10
  • S20
  • S30
  • ...
  • Sm0


Position 1

  • S11
  • S21
  • S31
  • ...
  • Sm1


Position k (some streams may already have reached their end)

  • S1k
  • S2k
  • S3k
  • ...
  • Smk

The lexer does not have one state, as in the traditional version of the algorithm, but a vector of states.

Imagine the letters allowed from position 0 till the end of any of the streams is:

search_index(0,'I amtiredlkswmngThb') (19 characters possible)

After reading the first character on all possible streams, the lexer can have up to 19 new states:

(state,history)

  • (S1,I)
  • (S2,a)
  • (S3,m)
  • (S4,t)
  • (S5,i)
  • (S6,r)
  • (S7,e)
  • (S8,d)
  • (S9,l)
  • (S10,k)
  • (S11,s)
  • (S12,w)
  • (S13,m)
  • (S14,n)
  • (S15,g)
  • (S16,T)
  • (S17,h)
  • (S18,b)

 

This assumes that all 19 characters are valid in state zero, and will lead to a new state. This will rarely be the case.

The tuples representing the simultaneous/combined lexer state have two fields:

  • Current state
  • History of characters read up till now

 

The lexer will read one additional character now. Say that the only valid states are:

  • (S1,I)
  • (S2,a)
  • (S3,e)

 

All other states are invalid in position 1. The search index is now:

search_index(1,' amti') (purposely simplified to only 5 characters)

These are the allowable characters after having read 1 character from any stream. Therefore, the lexer can now move to the following state:

  • (S1,I )
  • (S2,Ia)
  • (S3,Im)
  • (S4,It)
  • (S5,Ii)
  • (S6,a )
  • (S7,aa)
  • (S8,am)
  • (S9,at)
  • (S10,ai)
  • (S11,e )
  • (S12,ea)
  • (S13,em)
  • (S14,et)
  • (S15,ei)


If all allowable characters are valid in position 1 and lead to a valid state, the lexer is in the state above, in position 2.

This algorithm will eventually lead to a series of matches. For example, in position 5, the matches could be:

  • (S1,I eat)
  • (S2,A cat)

So, we know what the matches are, but we don't know in what (table,field,position) they match.

 

STAGE 3

For the example above, in which tuples (table,field,position) can we find the match "A cat"? We need to search the table search_letter_combinations in order to find out.

Searching for a combination of l < k characters, requires executing the following query:

select fieldid,position,rowid
from all_letters_combined
where numletters=5 and letters='A cat'


If the search letter blocksize k is smaller than the size of the search expression, the query will have to be cut into character blocks of maximum k characters each. Searching for a combination of k < l < 2 x k letters executing the following query:

select a1.fieldid,a2.position,a3.rowid
from search_letter_combinations a1
join search_letter_combinations a2
on
a1.fieldid=a2.fieldid and
a1.rowid=a2.rowid and
a2.position=a1.position+k
where
a1.numletters=k and a1.letters={first-k-letters}
a2.numletters=l-k and a2.letters={last-l-minus-k-letters}


Searching for a search result of 2 x k < l < 3 x k characters amounts to issuing the following query:

select a1.fieldid,a1.position,a1.rowid
from search_letter_combinations a1
join (search_letter_combinations a2,search_letter_combinations a3)
on
a1.fieldid=a2.fieldid and
a1.fieldid=a3.fieldid and
a1.rowid=a2.rowid and
a1.rowid=a3.rowid and
a2.position=a1.position+k and
a3.position=a2.position+k
where
a1.numletters=k and a1.letters={first-k-letters}
a2.numletters=k and a2.letters={next-k-letters}
a3.numletters=l-k and a3.letters={last-l-minus-k-letters}


For each additional block of k letters, we would have need to join one more time over the table search_letter_combinations.

The smaller k, the more joins there will be required in the query, and the slower it will be to search for the table fields and table rows matching a search expression. However, the larger k, the larger the table search_letter_combinations. It is a classical tradeoff between speed and index size.

Compiling a lexer state-transition table (STT)

Imagine we are searching for either one of the following three words:

  • water
  • waffle
  • road

 

In the following character stream:

There is a bridge that allows the road to cross the water.

How do we match these words?

The associated word rules are:

R1: w   a  t  e  r
-- 0  1  2  3  4  5M

R2: w   a  f  f  l  e
-- 0  1  2  3  4  5   6M

R3:
r  o  a  d  
-- 0  1  2  3  4M


We derive the simultaneous state transitions, starting in each rule in state zero:

  • (R1-0/R2-0/R3-0,w,R1-1/R2-1) followed by: a
  • (R1-1/R2-1,a,R1-2/R2-2) followed by: t,f
  • (R1-2/R2-2,t,R1-3) followed by: e
  • (R1-3,e,R1-4) followed by: r
  • (R1-4,r,R1-5M)
  • (R1-2/R2-2,f,R2-3) followed by: f
  • (R2-3,f,R2-4) followed by: l
  • (R2-4,l,R2-5) followed by: e
  • (R2-5,l,R2-6M)
  • (R1-0/R2-0/R3-0,r,R3-1) followed by: o
  • (R3-1,o,R3-2) followed by: a
  • (R3-2,a,R3-3) followed by: d
  • (R3-3,a,R3-4M)

 

For the sake of simplicity, we renumber the states:

  • 0: R1-0/R2-0/R3-0
  • 1: R1-1/R2-1
  • 2: R1-2/R2-2
  • 3: R1-3
  • 4: R1-4
  • 5: R1-5M (matches R1)
  • 6: R1-2/R2-2
  • 7: R2-3
  • 8: R2-4
  • 9: R2-5
  • 10: R2-6M (matches R2)
  • 11: R3-1
  • 12: R3-2
  • 13: R3-3
  • 14: R3-4M (matches R3)

 

In this way, we get the STT:

  • (0,w,1)
  • (1,a,2)
  • (2,t,3)
  • (3,e,4)
  • (4,r,5) (matches R1)
  • (6,f,7)
  • (7,f,8)
  • (8,l,9)
  • (9,l,10) (matches R2)
  • (0,r,11)
  • (11,o,12)
  • (12,a,13)
  • (13,a,14)  (matches R2)

 

Using reduction rules, we can even incorporate regular expressions into the STT, instead of just plain words.

 

Keeping the search tables synchronized from joomla

Whenever Joomla modifies data in its database, we need to update the following search tables:

  • search_letter_combinations(id,fieldid,numchars,rowid,position,chars)
  • search_index(id,position,allowable_chars_till_end)


We can use a similar technique as what Joomfish uses, and intercept the query() function in the Joomla Database class. It is a bit annoying that, after all these years of joomfishing, the Joomla core team still hasn't added an official hook to the Database.query() method.

The fact itself that you need to explicitly request a hook to the core Joomla team, is simply one of the major drawbacks of using the Php scripting language. If we could use server-side javascript instead, we could simply write something as:

oldQuery=Database.prototype.query;

beforeOldQuery=function() {...}
afterOldQuery=function() {...}

Database.prototype.query=function(sql)
{
beforeOldQuery();
oldQuery();
afterOldQuery();
}


It is massively complex to do a relatively simply thing like that in Php, but also in the so-called "enterprise-grade" languages Java, or C#.

The unshaved crowds touting these "enterprise-grade" languages for their elusive "enormous gains in productivity", never give any example of where those mythical gains could be obtained. At the same time, doing something as simple as demonstrated above, is excessively difficult, if possible at all, in those "enterprise-grade" languages.

It is only the "simplistic language" of javascript, which is able to solve the problem out of the box, blowing all competing, self-appointed "enterprise-grade" languages out of the water.

It is not just this capability that turns javascript into a productivity champion. Exposing class prototypes and staying clear from imposing a particular style of object-orientation is one of the many winner features of javascript. Moving to server-side javascript will rather sooner than later become a must.

 

Observing the database queries in order to maintain the indexes

SQL essentially has three statements to modify data: insert, update, and delete.

Inserting new records

When the database inserts a new record, it will send out an SQL statement in the style:

insert into table(id,{otherfield}) values(idvalue,{othervalue})

We can parse this statement with a simplistic regular expression.

If the table tp search, is registered in search_tables, we need to add records to the table search_letter_combinations, and update the search_index table for this record, for each otherfield registered in the table search_fields.  We could conceive a class SearchTables that would manage this for us:

searchTables.insertRecord(tablename,rowid,{fieldname:value})

calling into:

searchTables.insert(tablename,rowid,fieldname,value)


Updating existing records

We can simplify the problem of updates to existing records, by considering an update operation equivalent to a delete followed by an insert. We could limit the SQL supported to updates to single table updates. We can parse such statement with a simplistic regular expression:

update table set {f=v} [ where {clause} ]

if any field f is registered as a search field, we should select all records affected in the table, delete them from the search tables and enter them again:

rows = select * from table [ where {clause} ]


for each row in the rows selected, we would execute:

searchTables.delete(tablename,rowid)
searchTables.insertRecord(tablename,rowid, {f,v})


Deleting records

delete from table [ where {clause} ]

We can parse a delete statement with a simplistic regular expression:

rows = select * from table [ where {clause} ]


for each row in the rows selected, we would execute:

delete from search_letter_combinations where fieldid=<id> and rowid=<rowid>

In the table search_index, we need to decrement the count for the letters allowed.

 

Search results

The search result records would look like this:

(tablename,rowid,fieldname,search_result,position)

The joomla search plugins would have to register tables and fields to search. There would also be a need to implement for each table a search handler plugin, that will format the search result html, and this, for each table row in which search results have been found. It would have to call the plugin for each tuple below:

(tablename,rowid, {fieldname,search_result,position})

 

Conclusion

The search system described above would allow to search quickly for search terms or regular expressions, across the database, much faster than is possible today with the current architecture for the joomla site search. The search results would become available almost instantaneously.

It would also allow to search by multiple search terms and regular expressions simultaneously.

It would, however, substantially slow down database updates for updates that update table fields registered with the search system. This may be an issue for sites with a large number of visitors. In such case, it would also be possible to queue the update of the search tables, and proceed only as fast as the remaining server resources allow.

It will take some development effort to complete this Joomla library and plugin, but I think it will be worth it. I first still need to finish the Php lexer-parser generator library, of which I have started the development already, and test drive it. Next, I can re-use the Php lexer-parser generator library for test driving the improvements to the Joomla site search, that I have proposed in this blog post.

 


blog comments powered by Disqus
 

Virtuemart

Use Virtuemart to expose and sell your products online.

Joomfish

Use Joomfish to publish in foreign languages and reach clients abroad.

Joomla 1.5 Templates by Joomlashack