Wednesday 6 January 2010

To write a spider, you'll need some kind of batch script that reads the tables and creates the entries in your keywords table. Generally, this is pretty simple, along the lines of (sorry about the horrible pseudo-code, mixture of 4gl and java, sorry!)....

for each customer
split_into_words(customer.name, 100)
split_into_words(customer.address1, 10)
split_into_words(customer.address2, 10)
....etc....
end foreach

for each invoice
store(invoice.number,100)
store(invoice.customer_reference_number,20)
end foreach

.... etc .....

function split_into_words(string, significance)
string[] str = split(string)
for each str
if exists in [keywords table]

update it, adding significance
else
insert into [keywords table]
end foreach
end function

function store(string, significance)
if exists in [keywords table]
update it, adding significance
else
insert into [keywords table]
end function



---------

This then needs to be able to be called from just about anywhere, either respidering a particular customer, or a particular subset of accounts, or a particular subset of invoices, etc.

The section field within the keywords table is an indicator to whatever is reading this table as to where to redirect. eg. If an invoice, put an "I" in the section, the function that reads the keywords table should know to to produce a link to an invoice. If a customer, put a "C" in the section, the reader of the keywords table should produce a link to a customer. Etc, etc, etc.

The difficult parts are

1) assigning significance. If you have customer id 32800 and invoice id 32800, which is more important? Is "Chester Street Motors" more important than a company that's on "Chester Street"? Significance tuning is an art, but if what the user wants is in the top 10, it'll be ok. Important: Watch what users search for, and code your spider accordingly.
2) When and what to respider? Ideally, the spider runs every time something new gets created on the system. You need to be able to respider from anwhere, from any system, whether it be the modern Java or C# or from your legacy cobol systems. Make sure, however you write your spider, you can call it with multiple options (respider this account, these invoices, etc), from anywhere, at any time.
3) Creating the keyword index is an intensive process. You MUST not affect the running of the live system. In practice, when running a full respider, this means creating a temporary table that is then renamed and replaced over the live table (renaming a table, once built, is inexpensive for a database). And, how do you cope with additions during the re-indexing process? I leave that answer to you, or possibly, to google :-)
4) Finally, how do you cope with mis-spellings? For example: Refrigeration Supplies Ltd. Is that spelt right? Are you sure it's not Refridgeration Supplies Ltd? So, it might be spelt wrong in your database, or in the users search. How do you deal with that. (I suggest further reading on Soundex, and Double-Metaphone, I don't really have a great solution for you!!!)