Searching a name database

ammoQ

Currently my company faces a problem that seems easy but eventually turned out to be a problem.

The task is to search through a database of ~6 million persons; the most typical search is obviously by name.

Caveats: a) must be fast (in the range of 1-3 seconds)

b) results shall be sorted by name

c) there is only one search field for first name and last name; the search should go over both (searching for "harrison" returns both George Harrison and Harrison Ford)

d) should support partial search like "ro*" to find "Robertson" and "Rodriges" alike.

e) should support searches like "gates NOT bill" to find all poor members of the Gates clan.

Our current implementation uses Oracle Text, but as it turns out, requesting sorted output takes a lot of time (e.g. for a search returning ~20000 records, Oracle Text returns the first results quickly, but takes one whole minute to return the whole result set; oviously, sorting takes place after the whole result set has been selected.

Any suggestions how to implement that faster?

asuffield

If you want an algorithm, you're looking for a form of trie. This isn't an easy problem, but it is a solved problem (about half of Knuth vol. 3 discusses how to deal with it).

dhromed

@asuffield said:

This isn't an easy problem, but it is a solved problem

This also means a third-party solution will do the job better than an in-house experiment.

asuffield

@dhromed said:

@asuffield said:
This isn't an easy problem, but it is a solved problem
This also means a third-party solution will do the job better than an in-house experiment.

Only if you can find somebody selling one. Surprisingly few solved problems in computer science have implementations available on the market.

PSWorx

@asuffield said:

@dhromed said:
@asuffield said:
This isn't an easy problem, but it is a solved problem
This also means a third-party solution will do the job better than an in-house experiment.
Only if you can find somebody selling one. Surprisingly few solved problems in computer science have implementations available on the market.

How about this one? ;)

Seriously though, in general you may be right but I can't imagine no third-party solutions for this particular problem. After all, it's one of the core components of every modern search engine.

asuffield

@PSWorx said:

@asuffield said:
@dhromed said:
@asuffield said:
This isn't an easy problem, but it is a solved problem
This also means a third-party solution will do the job better than an in-house experiment.
Only if you can find somebody selling one. Surprisingly few solved problems in computer science have implementations available on the market.
How about this one? ;)

Google's search products are quite slow by comparison. Their focus is on intelligence rather than performance - taking a vague query and producing the results you wanted sorted by "best" match, at the cost of search time. They throw vast amounts of hardware at the search problem because they have to handle vast numbers of queries; they don't try to get any individual search to run much faster than a large fraction of a second.

Seriously though, in general you may be right but I can't imagine no third-party solutions for this particular problem. After all, it's one of the core components of every modern search engine.

Everybody wants to build flashy bloatware that sells well to managers and yuppies. Nobody wants to build simple, fast solutions to common problems. It's all part of the big problems with the industry.

If home hardware supplies were produced along similar lines, then we'd have multitools stacked up on every shelf, but it would be completely impossible to buy a bag of nails.

If you want it fast you can use distributed computing to keep all parts of your database in memory at all times and just dole out sections from a server.

This uses an approach like this (obviously they don't keep it in memory, but they do use a surprising number of servers to handle each request).

lpope187

Is this on a fully loaded server or is there some other bottle neck? My laptop can perform that type of query much faster even when doing a full table scan. Granted I'm using MSSQL 2005 and it's not heavily loaded. The query I'm issuing is something like the following:

SELECT Id, FirstName, LastName FROM Names WHERE FirstName like 'Smith%' OR LastName like 'Smith%' ORDER BY LastName, FirstName.

I get the sorted result set back in about 6 seconds.

When I use Full Text indexing and change to Contains, it takes around .5 seconds. Somehow, I don't think Oracle's Full Text implementation would suck that badly compared to Sql Server's, so I'm inclined to believe something else might be the issue like server load, tablespace config, indexing, etc.

ammoQ

@lpope187 said:

Is this on a fully loaded server or is there some other bottle neck? My laptop can perform that type of query much faster even when doing a full table scan. Granted I'm using MSSQL 2005 and it's not heavily loaded. The query I'm issuing is something like the following:

SELECT Id, FirstName, LastName FROM Names WHERE FirstName like 'Smith%' OR LastName like 'Smith%' ORDER BY LastName, FirstName.

I get the sorted result set back in about 6 seconds.

When I use Full Text indexing and change to Contains, it takes around .5 seconds. Somehow, I don't think Oracle's Full Text implementation would suck that badly compared to Sql Server's, so I'm inclined to believe something else might be the issue like server load, tablespace config, indexing, etc.

Well, the test server (where I measured the performance) definitely sucks. On the other hand, even if the production server is 10x as fast, it still would not be fast enough. I think I will give MS SQL Server2005 a try, even if is not likely that I can use it for this project. But maybe for the next. Does the express edition (or whats-it-called) of SQL Server 2005 include that full text indexing feature too? If yes, how would I create and query a table that uses it? (I know, I should RTFM...)

lpope187

Yes, the express edition includes full text indexing (or atleast the docs says it does). It's fairly easy to setup - the trick is to enable full text on database creation. If you forget, you have to enable it via SQL like so (I can't find a gui option once the DB is created):

EXEC [MyDB].[dbo].[sp_fulltext_database] @action = 'enable'

Then in Sql Management Studio, just right click on the table and go to the Full Text Indexing option and define the index. That'll start the wizard to create the catalog, define the columns and setup the tracking / population schedule. After the catalog is defined, it'll show up in the storage folder in the object browser of the database if you need to redefine anything.

Once you have the Full Text Index, you can use the full text clauses Contains, Freetext, ContainsTable and FreetextTable. You might want to look at the docs for those clauses and compare to Oracle. Chances are it's not quite a full-featured.

My only hesitation would be to use the Developer's edition of SQL rather than Express as I don't know if Express will have the same performance charateristics. It's only around $50 US and runs more like the server versions (in fact it's equivalent to the enterprise edition).

db2

@lpope187 said:

My only hesitation would be to use the Developer's edition of SQL rather than Express as I don't know if Express will have the same performance charateristics. It's only around $50 US and runs more like the server versions (in fact it's equivalent to the enterprise edition).

Bear in mind it's not licensed for production use. For development setups, though, go nuts (I've got about 3 copies installed on my workstation).

lpope187

@db2 said:

@lpope187 said:
My only hesitation would be to use the Developer's edition of SQL rather than Express as I don't know if Express will have the same performance charateristics. It's only around $50 US and runs more like the server versions (in fact it's equivalent to the enterprise edition).

Bear in mind it's not licensed for production use. For development setups, though, go nuts (I've got about 3 copies installed on my workstation).

My comment about it being equivalent to enterprise was kind of misleading. What I meant to say was that it has all the features of enterprise and can be a source of frustration if you don't understand the differences. You have unfettered access to all the features when developing, but you need to make sure that the features on production are still available otherwise stuff will fail. Got burned a few times during the Beta / Go Live! timeframe when featuresets either weren't documented or still under development.

ammoQ: Have you considered a q-gram based search algorithm? Even with 6 million records, it should provide the speed you require.

ammoQ

@lpope187 said:

ammoQ: Have you considered a q-gram based search algorithm? Even with 6 million records, it should provide the speed you require.

Can you provide a link to a description of that algorithm?

asuffield

@ammoQ said:

@lpope187 said:
ammoQ: Have you considered a q-gram based search algorithm? Even with 6 million records, it should provide the speed you require.
Can you provide a link to a description of that algorithm?

It's just "any algorithm that operates on length-q substrings". In practice, anything that talks about a q-gram is a form of trie.

TGV

This requires specialist software. I used to work for a company that makes (plug-in) software precisely for these purposes. They're called Polderland (www.polderland.nl). Apart from most of the requirements above, it can also search for names that are spelled slightly differently (e.g. ending in -ez instead of -es). It's quite advanced. The only thing it didn't have at the time was the NOT operator, but I'm sure it's quite easy to add on top.

lpope187

As ausffield indicated, there isn't just one algroithm but it is a family of methods using q length substrings. For speed purposes it requires a separate index that needs to be preprocessed. So if you are constantly changing data in the source table, you'll need to account for that.

Q-gram based algorithms are very popular in fuzzy matching and data cleansing. Some algorithms are domain specific with person identification and dna sequencing being some of the more popular ones. I'd imagine that Google's "Did you mean" is based on a q-gram algorithm.

A good place as any to start is with Microsoft Research papers here: http://research.microsoft.com/dmx/DataCleaning/. Two years ago I used a bi-gram based algorithm for data cleansing based on on what some census bureau used. Good key words are fuzzy matching, q-gram, and data cleansing.

If you go this route, the searching will most likely need to be different but also more natural for the end user. You don't use wildcards since you define a threshold of similarity 0 (no match) - 1 (exact match). The other nice thing (at least in comparison to MSSQL's full text searching and soundex) is you don't need the beginning of the string to be correct.

Example:

Source set contains

1, Clements, Dave

2, Smith, Bob

Searching for "lements, David" should give you something like

1, Clements, Dave, 0.90

2, Smith, Bob, 0.35

Qwerty

@ammoQ said:

Well, the test server (where I measured the performance) definitely sucks. On the other hand, even if the production server is 10x as fast, it still would not be fast enough.

A properly-configured production server (buffer sizes etc) could easily be 1000 times faster than a default-configured prototype.

Is it possible to sort the data in the database? That way the first items fetched will already be the top of the alphabet. Of course full-text indexing doesn't work this way and that is an important clue: full-text is not the correct kind of index.

Full-text is supposed to help you find words, phrases and concepts within large blocks of text, such as webpages. If you're only searching first-name/last-name fields then normal indexing should be sufficient. That's why you can't buy any plug-in products that do this. This is a core function of any database. If you have Oracle then you already have the best tool to solve this problem.

Oracle even has functions like Soundex, which allows you to search for names that sound alike. A more generalised search for mispelled names takes a bit more effort but I have found that a Levenshtein Distance of 3 or less is perfect when searching on full name (first + last.)

rdrunner

Hello...

I think its important to know the requirements more exactly.... Will there be a way to search for words ENDING with a specific string? like '%ser' or will it be strictly 'whatever%' ?

If you only have the 2nd case, then you can get away with using a simple covering index ofer the 2 search fields, and you might consider adding a 2nd index for the 1st name. This will give you blazing fast results, since you would only do index scans on a small portion of the index. You could futher optimize it by making the index (Lastname,Firstname) also the clustered index on your table. This would lead to a reduced disk access, since all related data would be placed close together on the disk.

Another option would be to include a precomputed field in your data, which allows for a phonetic matching via a "soundex" (or related algorithms). But this would only araise when you have the need for a phonetic matching. Without that requirement a good covering index will do the trick nicely.

But you NEED a covering index for this. Many ppl seem to think that there is no difference between a table which hold 2 distinct indexes (one for lastname and one for firstname), and a table which holds one index (one covering (lastname,Firstname)).

In the 1st example you can search for "Joe, Smith" and you will have all "Joe,Smith" (say 100 Records) neatly below a single node in that index. In the 2nd example you will get returned ALL Joes (say 5000) and ALL "Smith's" (say 100.000) from the querry, and you need to check every single Joe if he exists in the "Smith List". And this is very expensive.

Hope this makes sense ;)