Soundex, sownd-ecks, zoudics, soundticks, xuwmtex,…


Soundex is one of those old-school techniques that has been made largely irrelevant by other tools and techniques that have developed over the years. However, I still find the occasional use for Soundex.

What is Soundex? It is a simple technique for finding words, particularly names, that sound alike (Schmidt, Schmitt, Schmid, Schmitz). It is very handy in cases when users have to find people by name, but are not sure of the exact spelling of the name (e.g. customer service reps dealing with customers over the phone or other cases where the user must guess at how something is spelled).

The basic idea is create a numeric code that represents the major sounds in the word and then use this value as an index. If two words sound the same, then they should generate the same index: so asking for Christie will find anyone named Christie, Christy, or Krystee. Similarly even an gross misspelling like Iynstine will successfully retrieve the entry for Einstein.

The basic algorithm is as follows (I will use the fictional name Horrorsmell, as it shows most of the rules being applied):

  1. Trim the word, normalize the case, and remove all non-alphanumeric characters. This includes removing any diacriticals (accents, umlauts, etc.).   Horrorsmell => HORRORSMELL
  2. Remove any Hs and Ws except if it is the initial letter. HORRORSMELL => HORRORSMELL
    Note: I prefer to remove initial Hs and Ws (see the notes for the next step). HORRORSMELL => ORRORSMELL
  3. The first letter of the name becomes the first letter of the Soundex code. HORRORSMELL=> H
    Note: I prefer not to do this step since it means that users must correctly guess the 1st letter of the desired word – which is not easy with names like Xavier [does that start with an s or x or z?], or places like Djibouti [j or g? What!  D?  Are you kidding me?] or words like mnemonic [nuw-what?]. Instead, I let the subsequent rules apply to the first letter and end up with a fully numeric Soundex code. Additionally, my method results in an integer key, which results in faster lookup times. ORRORSMELL remains ORRORSMELL
  4. Replace each subsequent letter with the corresponding numeric code from below:
          A, E, I, O, U, Y = 0
                B, F, P, V = 1
    C, G, J, K, Q, S, X, Z = 2
                      D, T = 3
                         L = 4
                      M, N = 5
                         R = 6

    H and ORRORSMELL=> H0660625044, or ORRORSMELL => 0660625044

  5. Combine all consecutively repeated numbers into one.  H0660625044 => H06062504, or 0660625044 => 06062504
  6. If the 1st number is the same as the code number for the initial letter, delete that number (for this H and W are the same as 0). H06062504 => H6062504
    Note: step 5 takes care of this automatically if you are using my variation (all numeric) so we can ignore this step. 06062504 stays 06062504
  7. Delete the zeroes. H6062504 => H66254, or 06062504 => 66254
    Note: it is important to not delete the zeroes before this point since doing so can cause step 5 to remove important syllable distinctions.
  8. Retain the initial letter and the first 3 numbers. Pad with trailing zeroes to obtain a 4-character code. H66254 => H662
    Note: I see no reason why you need to limit yourself to just 4 characters. I have often found 5 and 6 character codes make great indices that make differentiation between longer names to be easier. You can experiment to find what works best in your situation – but don’t feel limited to just 4 characters or digits. 66254 remains => 66254

So now, the name John Horrorsmell would be stored with a last name index value of H662 or 66254. If the phone operator mishears John over the phone and tries to look up John Hurarznel, the system will convert Hurarznelto a soundex value of H662, or 66254, which brings up a list of matching customers that will include John Horrorsmell.

There you go – a simple way to see if two words sound similar.

Advertisements

3 thoughts on “Soundex, sownd-ecks, zoudics, soundticks, xuwmtex,…”

  1. Bonjour je dois dire que tout ceci est particulièrement intéressant mais ne pensez vous pas que cela peut être compliqué par la plupart des gens ?

    1. Unfortunately, my high-school French, Google’s translator, BabelFish and other online translators aren’t good enough to fully understand your question.

      Could you please reply again with different wording. Perhaps I will be able to get a better translation.

    2. I think Google has finally improved to the point whereby I can understand your question: Don’t I think this would be confusing to the average user?

      The idea is not for the user to perform the soundex conversion. Rather it would occur behind the scenes.
      Let’s suppose we have a customer lookup system that allows users to enter a first and last name to find customers.
      And lets suppose we have three similarly named customers:
      “Sean Bryan”
      “Shawn Brien”
      “Shaun Brain”

      Now, suppose a user is on the phone and a customer calls and says his name is “Sean Bryan”. Now, the user is not familiar with UK names and so he hears and types “Shawn Brian” into the system to find the customer.

      With a typical system, looking up FirstName = “Shawn” and LastName = “Brian” would fail – there is no such customer. Such a system then, typically, requires the customer to provide an unfriendly, impersonal account number in order to be identified.

      With a soundex enabled system the user entry would be run through the soundex algorithm and come up with values of 205 and 1605. A search for FirstNameIndex = 205 AND LastNameIndex = 1605 would find pull up
      “Sean Bryan”
      “Shawn Brien”
      “Shaun Brain”
      and some others such as “Jan Foum”. But at least it would find the customer and give the user the opportunity to make sure that they are pulling up the right account by verifying the address, or phone number.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s