Levenshtien distance

The Levenshtien distance or edit distance $L(s_{1},s_{2})$ between two strings $s_{1},s_{2}$ (not necessarily of equal length!) is the number of swaps, insertions, and deletions required to turn one string into another.

The following algorithm computes the Levenshtein distance, and also serves as it's specification.

:{
lev :: (Eq a) => [a] -> [a] -> Int
lev [] b = length b
lev a [] = length a
lev (a:as) (b:bs) | a == b = lev as bs
                  | otherwise = 1 + minimum [lev as (b:bs), lev (a:as) bs, lev as bs]
:}

Examples

$L(\mathtt{flaw},\mathtt{lawn})=2$
$L(\mathtt{kitten},\mathtt{sitting})=3$
$L(\mathtt{uniformed},\mathtt{uninformed})=1$

Properties

If the two strings have the same length, then the Hamming distance serves as an upper bound on the Levenshtein distance.

The Levenshtein also satisfies the Triangle inequality. Another way of saying this is that Levenshtein distance forms a Locally Graded Category. In fact, Levenshtein distance is a canonical example of why we need to consider local gradition in Constructive Mathematics; in general, we cannot compute the levenshtein distance unless the underlying type of a string has Decidable Equality; however, the distance is still well defined even if we don't have decidable equality!

Algorithms

One natural approach to computing Levenshtein distances is via dynamic programming. This is essentially the same approach used in Levenshtein distance counts edits, but where we memoize the results in a table.

We can do this nicely in Haskell with a bit of knot-tying magic.

:{
import Data.Functor

-- We start with a type of a bounded grid equipped with a focus.
data Grid a = Grid { grid :: [[a]], focus :: (Int, Int), bounds :: (Int, Int) }
  deriving (Show, Functor)

-- We can refocus the grid
refocus :: Grid a -> Int -> Int -> Grid a
refocus g i j = g { focus = (i, j) }

-- and we can extract the element at the current focus.
extract :: Grid a -> a
extract (Grid { grid = grid, focus = (i, j) }) = grid !! i !! j

-- Composing these together gives us access to the "subproblem" at (i, j)
gsubproblem :: Grid a -> Int -> Int -> a
gsubproblem g i j = extract (refocus g i j)

-- 'Grid a' forms a comonad, where 'extract f g' applies f to 'g' refocused
-- at every position.
extend :: (Grid a -> b) -> Grid a -> Grid b
extend k (g@Grid { bounds = (x, y) }) =
  let grid =
        [0..x - 1] <&> \i ->
        [0..y - 1] <&> \j ->
        k (refocus g i j)
  in g { grid = grid }

-- Take a fixpoint of a function that is paremeterised over an 'x' by 'y' grid,
-- where the focus is set to the coordinates '(x-1, y-1)'.
gfix :: Int -> Int -> (Grid a -> a) -> Grid a
gfix x y f =
  let gfix = extend f g
      g = Grid { focus = (x-1, y-1), bounds = (x, y), grid = grid gfix }
  in g

-- A dynamic programming version of Levenshtein distance.
dlev :: (Eq a) => [a] -> [a] -> (Int -> Int -> Int) -> Int -> Int -> Int
dlev as bs subproblem i 0 = i
dlev as bs subproblem 0 j = j
dlev as bs subproblem i j | as !! i == bs !! j = subproblem (i - 1) (j - 1)
                          | otherwise = 1 + minimum [subproblem (i - 1) j, subproblem i (j - 1), subproblem (i - 1) (j - 1)]

-- A bit of 'gfix' lets us tie the knot!
dist :: (Eq a) => [a] -> [a] -> Int
dist as bs =
extract $ gfix (length as) (length bs) $ \(g@(Grid {focus = (i, j)})) ->
     dlev as bs (gsubproblem g) i j
:}

dist "kitten" "sitting"

Note that this is (essentially) the following algorithm, but where we simply count the number of non-ok edits at the end.

:{
import Data.Foldable
import Data.Function

data Edit a = Keep a | Subst a a | Insert a | Delete a
  deriving (Show)

cost :: Edit a -> Int
cost (Keep _) = 0
cost _ = 1

edits :: (Eq a) => [a] -> [a] -> [Edit a]
edits [] bs = fmap Insert bs
edits as [] = fmap Delete as
edits (a:as) (b:bs) | a == b = Keep a : edits as bs
                      | otherwise = minimumBy (compare `on` (sum . fmap cost)) $ [Delete a : edits as (b:bs), Insert b : edits (a:as) bs, Subst a b : edits as bs]

:}

edits "kitten" "sitting"

This in turn does a search on the following edit tree for the shortest series of edits.

:{
data Tree a = Node [(a, Tree a)]
  deriving (Show)

tree :: [a] -> Tree a
tree = foldr (\a as -> Node [(a, as)]) (Node [])

editTree :: (Eq a) => [a] -> [a] -> Tree (Edit a)
editTree [] bs = tree $ fmap Insert bs
editTree as [] = tree $ fmap Delete as
editTree (a:as) (b:bs) | a == b = Node [(Keep a, editTree as bs)]
                    | otherwise = Node [(Delete a, editTree as (b:bs)), (Insert b, editTree (a:as) bs), (Subst a b, editTree as bs)]

:}

editTree "ab" "b"

We can test our edit producing code like so:

:{
edit :: [Edit a] -> [a] -> [a]
edit [] as = as
edit (Keep _ : es) (a : as) = a : edit es as
edit (Subst _ b : es) (_ : as) = b : edit es as
edit (Insert a : es) as = a : edit es as
edit (Delete _ : es) (_ : as) = edit es as
:}

edit [Subst 'k' 's',Keep 'i',Keep 't',Keep 't',Subst 'e' 'i',Keep 'n',Insert 'g'] "kitten"
edit [Insert 'b', Delete 'a', Delete 'b'] "ab"