Matching Inquiry
There are two main ways that search can be used.
Inquiry: You want to find information about something
Match: You have a single document or record in mind
Search engines are typically used for inquiry search: you specify a keyword which should retrieve the information you require. Desktop search is almost always match search – you know a document exists and you want to find it. Yellow pages search may be either – you remember part of the name of the company you want to find (match), or you want to find companies that provide a type of product or service (inquiry).
Many data quality tasks can be seen as a type of match search. Suppose you have a database of contact details and want to merge it with another. The same person may be represented in each database in a different way, but these records should be matched to avoid duplicates. This is achieved by searching for each record in first database in the second database.
When doing inquiry search its usually easy to find results, the main problem is to rank them well. There may be hundreds, thousands or more potential results, but the user will only look at the first handful, maybe some tens if they are really serious – so its really important to find a way to make the best results come to the top. A large part of Google's success is down to their page rank algorithm. Sometimes the query may be misspelled, but a did-you-mean spell checker is a good approach to solving this problem, as seen on Google.
Page Rank is the method that Google uses to determine the relevance and strength of a page to satisfy a search term or keyword.Everyone knows about "Page Rank". But What no one knows about is how Google change "Page Rank" every so often and the algorithms behind it. This alone enables Google to stay ahead of the competition and provide the "best search result" for the surfers.
...
Google Page Rank algorithms is a secret that they guard with their lives. Not only that, it is the most influential single factor in building a $160billion empier<sic> called Google.
You will not find more than three people within Google's world who know the whole working of "Page Rank", and rightly so, if Page Rank to remain a winning technical strategy that fits with the overall corporate strategy.
When performing a match search there is either a single target, or perhaps a small set of targets. So here what becomes important is making sure that the specified query search terms are found in the target – even if the query and target do not match exactly. Just as with inquiry searching, very frequent terms such as 'the' in text or 'street' in addresses are uninteresting. But in match search rarity becomes increasingly important as a distinguishing feature to uniquely identify a target.
Approximation or fault tolerance is an essential feature of match search. When we look for matches we will typically consider three types of similarity:
Lexicographical (look the same)
Phonetic (sound the same)
Semantic (mean the same)
For example:
Kermit the Frog / Krmit the Frag – have similar characters in and look similar
Kermit the Frog / Cirmit the Frog – could be pronounced the same way
Kermit the Frog / Kermit the Toad – look and sound different, but have a similar meaning
In my next post I'm give some details on how fault tolerant matching can be achieved.
Recent Comments