Heuristic String Comparison with Levenshtein Distance

 

The Levenshtein Distance algorithm provides a practical example of dynamic programming.  It allows us to determine the minimum number of operations required to transform one string into another using the operations: Add, Delete, Alter, or Keep.

 

Compare strings "kitten" and "sitting"



s

i

t

t

i

n

g


0

1

2

3

4

5

6

7

k

1

1

2

3

4

5

6

7

i

2

2

1

2

3

4

5

6

t

3

3

2

1

2

3

4

5

t

4

4

3

2

1

2

3

4

e

5

5

4

3

2

2

3

4

n

6

6

5

4

3

3

2

3

 

In the top row we have how many operations it would take to change the substring (up to the column) to a null string.  This would require a delete for every character of the source string.  Therefore the top row is the count of the characters in the source string and the horizontal move across the matrix represents the Delete operation.

In the far left column we have how many operations it would take to change a null source string into the target substring (up to the row).   This would require an addition for every character of the source string.  Therefore the left column is the count of the characters in the target string and the vertical move down the matrix represents the Add operation.

We can trace the path of the minimal number of operations to transform source to target starting at the bottom right cell or from the upper left cell.  Let's start from the upper left as it should be more intuitive.  s != k so we Alter.  Result = "kitting."  i, t, t all match in both strings so we Keep.  Note then that a Diagonal move down and to the right constitutes both the Keep and Alter operations.  If the characters match, we have the choice to do either.  Keep is always superior to Alter as it requires zero operations though.  If the characters do not match the Diagonal move is limited to Alter and costs one operation.

Moving forward with our string "kitting," i != e so we Alter.  Result = "kitteng."  n == n so we Keep.  Result = "kitteng."  g != n so we Delete.  Result "kitten," and we're finished.  Total operation count, 3.

 

Here's the Java implementation of the algorithm:

```

private int editDistance(String src, String tgt) {
       
int len1 = src.length();
        int
len2 = tgt.length();

        int
dp[][] = new int[2][len1 + 1];
        for
(int i = 0; i <= len1; i++) {
            dp[
0][i] = i; // when 2nd string empty we remove all / add all
           
dp[1][i] = 0;
       
}

       
for (int i = 1; i <= len2; i++) {
           
for (int j = 0; j <= len1; j++) {
               
if (j == 0) //src empty
                   
dp[i % 2][j] = i;
                else if
(src.charAt(j - 1) == tgt.charAt(i - 1)) { // chars match, carry fwd prior result
                   
dp[i % 2][j] = dp[(i - 1) % 2][j - 1];
               
} else { // 1+ min(add, remove, replace)
                   
dp[i % 2][j] = 1 + Math.min(dp[(i - 1) % 2][j],
                       
Math.min(dp[i % 2][j - 1],
                           
dp[(i - 1) % 2][j - 1]));
               
}
            }
        }

       
// if the len2  even - 0th row else 1st row
       
return dp[len2 % 2][len1];
}

```

Now for the practical use of the algorithm.  I've built a DSL for comparing and enriching data across 1:N documents.  It's generally useful for automating "stare and compare" type operations such as those performed by accountants to assure data accuracy across documents or reconciliations between upstream and downstream data sources.  One business use case is comparing a customer's employer's name.

Let's say we have a customer that claims to be employed by "Lario and Muigi's Pizza" restaurant and we have the need to verify this against some "source of truth."  We get the customer's employer information via some 3rd party like a credit bureau and want to validate it.  We'd prefer the process be done automatically.  The bureau's data says, "Lario and Muigi's Pizza LLC."  A pure equals/not-equals comparison won't suffice.  They're clearly the same but off by a few characters.

Here we can use Levenshtein distance to measure how far off they are.  There's an extra 4 characters in the source-of-truth.  If we divide this by the length of the source string, we get a normalized measure for how off the source and target are.

Levenshtein Distance / String Length

In our example we have 4/23 = 17% difference.  From here we can establish a cutoff heuristic beyond which we escalate the comparison to a human.  If in our case the cutoff is 20%, we accept the employers as "close enough."

Similarly, the algorithm can be used for things like typo detection or in combination with a type-ahead lookup for finding likely suggestions for misspellings.


Wikipedia Levenstein Distance

Similar algorithm: Gestalt Pattern Matching

Comments

Popular posts from this blog

Engineering Truisms

The Telescoping Constructor (Anti-Pattern)

Software Capex: The Cost of Flexibility