Soundex is one of those old-school techniques that has been made largely irrelevant by other tools and techniques that have developed over the years. However, I still find the occasional use for Soundex.
What is Soundex? It is a simple technique for finding words, particularly names, that sound alike (Schmidt, Schmitt, Schmid, Schmitz). It is very handy in cases when users have to find people by name, but are not sure of the exact spelling of the name (e.g. customer service reps dealing with customers over the phone or other cases where the user must guess at how something is spelled).
The basic idea is create a numeric code that represents the major sounds in the word and then use this value as an index. If two words sound the same, then they should generate the same index: so asking for Christie will find anyone named Christie, Christy, or Krystee. Similarly even an gross misspelling like Iynstine will successfully retrieve the entry for Einstein.
The basic algorithm is as follows (I will use the fictional name Horrorsmell, as it shows most of the rules being applied):
- Trim the word, normalize the case, and remove all non-alphanumeric characters. This includes removing any diacriticals (accents, umlauts, etc.). Horrorsmell => HORRORSMELL
- Remove any Hs and Ws except if it is the initial letter. HORRORSMELL => HORRORSMELL
Note: I prefer to remove initial Hs and Ws (see the notes for the next step). HORRORSMELL => ORRORSMELL
- The first letter of the name becomes the first letter of the Soundex code. HORRORSMELL=> H
Note: I prefer not to do this step since it means that users must correctly guess the 1st letter of the desired word – which is not easy with names like Xavier [does that start with an s or x or z?], or places like Djibouti [j or g? What! D? Are you kidding me?] or words like mnemonic [nuw-what?]. Instead, I let the subsequent rules apply to the first letter and end up with a fully numeric Soundex code. Additionally, my method results in an integer key, which results in faster lookup times. ORRORSMELL remains ORRORSMELL
- Replace each subsequent letter with the corresponding numeric code from below:
A, E, I, O, U, Y = 0 B, F, P, V = 1 C, G, J, K, Q, S, X, Z = 2 D, T = 3 L = 4 M, N = 5 R = 6
H and ORRORSMELL=> H0660625044, or ORRORSMELL => 0660625044
- Combine all consecutively repeated numbers into one. H0660625044 => H06062504, or 0660625044 => 06062504
- If the 1st number is the same as the code number for the initial letter, delete that number (for this H and W are the same as 0). H06062504 => H6062504
Note: step 5 takes care of this automatically if you are using my variation (all numeric) so we can ignore this step. 06062504 stays 06062504
- Delete the zeroes. H6062504 => H66254, or 06062504 => 66254
Note: it is important to not delete the zeroes before this point since doing so can cause step 5 to remove important syllable distinctions.
- Retain the initial letter and the first 3 numbers. Pad with trailing zeroes to obtain a 4-character code. H66254 => H662
Note: I see no reason why you need to limit yourself to just 4 characters. I have often found 5 and 6 character codes make great indices that make differentiation between longer names to be easier. You can experiment to find what works best in your situation – but don’t feel limited to just 4 characters or digits. 66254 remains => 66254
So now, the name John Horrorsmell would be stored with a last name index value of H662 or 66254. If the phone operator mishears John over the phone and tries to look up John Hurarznel, the system will convert Hurarznelto a soundex value of H662, or 66254, which brings up a list of matching customers that will include John Horrorsmell.
There you go – a simple way to see if two words sound similar.