Approximate pattern matching

Approximate pattern matching is a computer science problem where we try to find all occurances of a pattern string p [ 1 . . m ] p[1..m] in a text t [ 1 . . n ] t[1..n] within some "distance" d 𝑑 d , where we measure distance using some distance metric.

Some common distance metrics:

APM with Hamming distance

Consider the pattern p = ipsi 𝑝 ipsi p=\texttt{ipsi} in the word t = missippi 𝑑 missippi t=\texttt{missippi} . The Hamming distance H ⁒ ( p , t ⁒ [ 2..5 ] ) = H ⁒ ( πš’πš™πšœπš’ , πš’πšœπšœπš’ ) = 1 𝐻 𝑝 𝑑 delimited-[] 2..5 𝐻 πš’πš™πšœπš’ πš’πšœπšœπš’ 1 H(p,t[2..5])=H(\mathtt{ipsi},\mathtt{issi})=1 , so one substitution yields a match. Pattern matching with the Hamming distance can be done in O ⁒ ( n ⁒ log ⁑ m ⁒ min ⁑ ( ∣ Ξ£ ∣ , m / log ⁑ ( m ) ) ) 𝑂 𝑛 π‘š delimited-∣∣ Ξ£ π‘š π‘š O(n\log m\min(\mid\Sigma\mid,\sqrt{m/\log(m)})) time, where n 𝑛 n is the length of the string, m π‘š m the length of the pattern, and ∣ Ξ£ ∣ delimited-∣∣ Ξ£ \mid\Sigma\mid the size of the alphabet.

References

APM with Levenshtien distance

Couple of possible approaches here:

Dynamic programming is the easiest and most flexible, but slow: requires O ⁒ ( m ⁒ n ) 𝑂 π‘š 𝑛 O(mn) time.

:{
import Control.Monad
import Data.Function
import Data.Functor

-- | The dynamic programming pattern for APM and Levenshtien distance.
dmatch :: (Eq a) => [a] -> [a] -> (Int -> Int -> Int) -> Int -> Int -> Int
dmatch needle haystack subproblem 0 j = 0
dmatch needle haystack subproblem i 0 = i
dmatch needle haystack subproblem i j =
  if (needle !! (i - 1) == haystack !! (j - 1)) then
    subproblem (i - 1) (j - 1)
  else
    1 + minimum [subproblem (i - 1) j, subproblem i (j - 1), subproblem (i - 1) (j - 1)]

-- | 'approxMatch dist needle haystack' will find all occurances of 'needle' in 'haystack' that are
-- at most 'dist' edit distance apart.
approxMatch :: (Eq a) => Int -> [a] -> [a] -> [(Int, [a])]
approxMatch dist needle haystack = do
  let m = length needle
  let n = length haystack
  j <- [m..n]
  let c = fix (dmatch needle haystack) m j
  guard (c <= dist)
  pure (c, drop (j - m) $ take j haystack)
:}

approxMatch 1 "annual" "annealingannual"