Wiki Home

Levenshtein Algorithm

(Updated: 2006.09.29 03:37:18 AM)
Namespace: VFP
What is LevenshteinAlgorithm?

* ------------------------------------------------------------------------
* Levenshtein.prg
* ------------------------------------------------------------------------

** < summary>
** A base class for fancier techniques using levenshtein edit distance
** algorithm.
** < /summary>
define class FuzzyMatch as custom

   ** < summary>
   ** Computes the edit distance between two strings
   ** using the Levenshtein algorithm. The edit distance is how many
   ** changes need to be made to make them identical. The edit distance
   ** for identical strings is 0.0, the edit distance for completely
   ** different strings depends on what letters need to change or
   ** be added in order to make them the same.
   ** < /summary>
   function EditDistance ( ;
         sSource as string, ;
         sTarget as string ) as decimal
      local ;
         iLenSource as integer, ;
         iLenTarget as integer, ;
         iRow as integer, ;
         iCol as integer
      iLenSource = len( sSource )
      iLenTarget = len( sTarget )
      if iLenSource == 0.0
         return iLenTarget
      endif
      if iLenTarget == 0.0
         return iLenSource
      endif
      dimension Levenshtein( iLenTarget+1, iLenSource+1 )
      for iRow = 1 to iLenTarget + 1
         Levenshtein[iRow,1] = iRow -1
      endfor
      for iCol = 1 to iLenSource + 1
         Levenshtein[1,iCol] = iCol -1
      endfor
      for iCol = 2 to iLenSource + 1
         for iRow = 2 to iLenTarget + 1
            local ;
               dCost as decimal, ;
               cColChar as string, ;
               cRowChar as string
            dCost = 0.0
            cColChar = substr( sSource, iCol-1, 1 )
            cRowChar = substr( sTarget, iRow-1, 1 )
            if cColChar != cRowChar
               dCost = 1.0
            endif
            Levenshtein[iRow,iCol] = ;
               min( ;
               Levenshtein[iRow-1,iCol] + 1, ;
               Levenshtein[iRow,iCol-1] + 1, ;
               Levenshtein[iRow-1,iCol-1] + dCost )
         endfor
      endfor
      return Levenshtein[iLenTarget+1,iLenSource+1]
   endfunc

   ** < summary>
   ** Compares two strings and returns the % match using
   ** the EditDistance function.
   ** < /summary>
   function PercentMatch( ;
         sSource as string, ;
         sTarget as string ) as decimal
      local ;
         iLenSource as integer, ;
         iLenTarget as integer, ;
         dResult as decimal, ;
         dEditDistance as decimal
      iLenSource = len( sSource )
      iLenTarget = len( sTarget )
      if iLenSource + iLenTarget == 0
         return 0.0
      else
         dEditDistance = this.EditDistance( sSource, sTarget)
         dResult = (1.0-(dEditDistance/max(iLenSource,iLenTarget))) * 100.0
         return dResult
      endif
   endfunc

enddefine

** < summary>
** This class builds on FuzzyMatch to provide a nifty logic
** to come up with weighted % match between human names. So, if you
** want to search for a certain percentage match as a filter
** of some result set based on a given name as an input, you'd
** create a FuzzyName with those inputs, then compare to your
** result set by calling the FuzzyMatch.PercentMatchTo( fname,
** mname, lname, gen ) method.
** < code>
**    oFuzzyName = createobject("FuzzyName", "Robert", "B.", "Calco", "")
**    * 
**        ? oFuzzyName.PercentMatchTo(sFirstName,sMiddleName,sLastname,sGeneration )
**        ? oFuzzyName.IsWeakMatchTo(sFirstName,sMiddleName,sLastname,sGeneration )
**        ? oFuzzyName.IsStrongMatchTo(sFirstName,sMiddleName,sLastname,sGeneration )
**    * 
** < /code>
** It also lets you set your own weight (defaults: 20%,20%,50%,10%, respectively).
** You can also redefine a "strong" match (default: 90%) or a "weak" match
** (default 70%) as a two-tier cutoff system.
** < /summary>
define class FuzzyName as FuzzyMatch

   ** First name component of the FuzzyName
   protected sFirstName as string
   ** Middle initial/name component of the FuzzyName
   protected sMiddleInitial as string
   ** LastName component of the FuzzyName
   protected sLastName as string
   ** Generation (Jr., Sr., III, etc.) component of the FuzzyName
   protected sGeneration as string
   ** Percentage weight to give to the first name
   protected dFNWeight as decimal
   ** Percentage weight to give to the middle initial/name
   protected dMIWeight as decimal
   ** Percentage weight to give to the last name
   protected dLNWeight as decimal
   ** Percentage weight to give to the generation
   protected dGWeight as decimal
   ** Defines the quality of a "strong" match to another name
   protected dStrongMatchPct as decimal
   ** Defines the quality of a "weak" match to another name. Can
   ** be used to "cutoff" nonsense matches. Define to your requirements.
   protected dWeakMatchPct as decimal

   ** < summary>
   ** Constructor; sets the name components based on input and
   ** establishes the default weights and definition of strong vs.
   ** weak matches based on my experience. Your mileage may vary.
   ** < /summary>
   procedure init( ;
         sFirstName as string, ;
         sMiddleInitial as string, ;
         sLastName as string, ;
         sGeneration as string )

      * set the pieces and parts of this name
      this.sFirstName = sFirstName
      this.sMiddleInitial = sMiddleInitial
      this.sLastName = sLastName
      this.sGeneration = sGeneration

      * default weights
      this.dFNWeight = 0.2        && First name worth 20%
      this.dMIWeight = 0.2        && Middle initial worth 20%
      this.dLNWeight = 0.5        && Last name worth 50%
      this.dGWeight  = 0.1        && Generation worth a measly 10%

      * default Strong/Weak match pcts
      this.dStrongMatchPct = 90.0 && 90% match or greater
      this.dWeakMatchPct = 70.0    && 70% match or greater
   endproc

   ** < summary>
   ** A way to change the default weights. Call this before
   ** making any comparisons if you don't like mine.
   ** < /summary>
   procedure SetWeights( ;
         dFirstName as decimal, ;
         dMiddleInitial as decimal, ;
         dLastName as decimal, ;
         dGeneration as decimal )
      this.dFNWeight = dFirstName
      this.dMIWeight = dMiddleInitial
      this.dLNWeight = dLastName
      this.dGWeight  = dGeneration
      * TODO throw error if totals != 1.0
   endproc

   ** < summary>
   ** Sets the definition of a strong match if you think 90%
   ** is bogus.
   ** < /summary>
   procedure SetStrongMatchPct( dStrongMatchPct as decimal )
      this.dStrongMatchPct = dStrongMatchPct
   endproc

   ** < summary>
   ** Sets the definition of a weak match if you think 70%
   ** is bogus.
   ** < /summary>
   procedure SetWeakMatchPct( dWeakMatchPct as decimal )
      this.dWeakMatchPct = dWeakMatchPct
   endproc

   ** < summary>
   ** Returns boolean based on whether or not this FuzzyName
   ** is at least a weak match to the supplied name component inputs.
   ** < /summary>
   function IsWeakMatchTo( ;
         sFirstName as string, ;
         sMiddleInitial as string, ;
         sLastName as string, ;
         sGeneration as string ) as Boolean
      return this.PercentMatchTo( ;
         sFirstName, ;
         sMiddleInitial, ;
         sLastName, ;
         sGeneration ) >= this.dWeakMatchPct
   endfunc

   ** < summary>
   ** Returns a boolean based on whetehr or not this FuzzyName
   ** is at least a strong match to the supplied name component inputs.
   ** < /summary>
   function IsStrongMatchTo( ;
         sFirstName as string, ;
         sMiddleInitial as string, ;
         sLastName as string, ;
         sGeneration as string ) as Boolean
      return this.PercentMatchTo( ;
         sFirstName, ;
         sMiddleInitial, ;
         sLastName, ;
         sGeneration ) >= this.dStrongMatchPct
   endfunc

   ** < summary>
   ** This is the workhorse function that takes the components of
   ** another name and applies some heuristics to come up with an overall
   ** weighted match percentage. Should be obvious except for the additional
   ** logic to catch a particular Special Case; namely, when the only
   ** difference between the two names is a mismatch between the first
   ** and middle names. For instance "Bob B. Calco" and "B. Bob Calco".
   ** This was a common enough issue for us that we decided to go ahead and
   ** implement it. We basically force the percent match to be 100%-1% or 99%.
   ** If you see a 99% match, 99.99% of the time it will be for this reason.
   ** < /summary>
   function PercentMatchTo( ;
         sFirstName as string, ;
         sMiddleInitial as string, ;
         sLastName as string, ;
         sGeneration as string ) as decimal

      local ;
         dFirstNamePct as decimal, ;
         dMiddleInitialPct as decimal, ;
         dLastNamePct as decimal, ;
         dGenerationPct as decimal, ;
         dResult as decimal, ;
         dFudgeFactor as decimal

      * Get the % match on the first names. Picking the max of the
      * match to the first and middle names.
      dFirstNamePct = max( ;
         this.PercentMatch( this.sFirstName, sFirstName ), ;
         this.PercentMatch( this.sFirstName, sMiddleInitial ) )

      * Get the % match on the middle names. This is tricky only because
      * of the Special Case.
      if this.sMiddleInitial == "" and sMiddleInitial == ""
         dMiddleInitialPct = max( ;
            100.0, ;
            this.PercentMatch( this.sMiddleInitial, sFirstName ) )
      else
         dMiddleInitialPct = max( ;
            this.PercentMatch( this.sMiddleInitial, sMiddleInitial ), ;
            this.PercentMatch( this.sMiddleInitial, sFirstName ) )
      endif

      * Get the % match on the last names
      dLastNamePct = this.PercentMatch( this.sLastName, sLastName );

      * Get the % match on Generations. the only special consideration
      * here is that if they are both blank, rather than penalize the
      * whole name and lose 10% of an otherwise perfect match, we'll call
      * it 100% match---since it's true: both names have no supplied generation.
      if this.sGeneration == "" and sGeneration == ""
         dGenerationPct = 100.0
      else
         dGenerationPct = this.PercentMatch( this.sGeneration, sGeneration)
      endif

      * Return the weighted %, with adjustment for Special Case
      if this.sFirstName == sMiddleInitial and this.sMiddleInitial == sFirstName
         dFudgeFactor = 1.0
      else
         dFudgeFactor = 0.0
      endif

      dResult = ;
         (this.dFNWeight * dFirstNamePct) + ;
         (this.dMIWeight * dMiddleInitialPct) + ;
         (this.dLNWeight * dLastNamePct ) + ;
         (this.dGWeight  * dGenerationPct ) - dFudgeFactor
      return dResult
   endfunc

enddefine




It would seem that the VFP implementation of this algorithm could be better. Below is what I come up with. I've opted for case-insensitive. And, it should be noted that when used against a dictionary (such as in a spellchecker), I would change the function below so that value sent in to tcSource would already be uppercase and a third parameter would be added so the Len + 1 of tcSource could also be sent in... both of these values (uppercase of tcSource and len + 1 of tcSource) would be held outside the scan (or whatever was calling the function) in variables as they would never changed during the pass through the dictionary table. -- Craig SBoyd
**********************************
FUNCTION LevenshteinDistance(tcTarget, tcSource)
  **********************************
  LOCAL lnLenTarget, lnLenSource, lnRow, lnCol, lnTempRow, lnTempCol, lnCost
  lnLenTarget = LEN(tcTarget) + 1
  lnLenSource = LEN(tcSource) + 1
  DIMENSION aryTable(lnLenTarget, lnLenSource)
  STORE 0 TO aryTable

  *!* Remove the following two lines of code
  *!* if you prefer a case-sensitive distance
  tcTarget = UPPER(tcTarget)
  tcSource = UPPER(tcSource)

  FOR lnCol = 1 TO lnLenSource
    aryTable[1, lnCol] = lnCol - 1
  ENDFOR

  FOR lnRow = 2 TO lnLenTarget
    lnTempRow = lnRow - 1
    aryTable[lnRow, 1] = lnTempRow
    FOR lnCol = 2 TO lnLenSource
      lnTempCol = lnCol - 1
      IF SUBSTR(tcTarget, lnTempRow, 1) == SUBSTR(tcSource, lnTempCol, 1)
        lnCost = 0
      ELSE
        lnCost = 1
      ENDIF
      aryTable[lnRow, lnCol] = MIN(aryTable[lnTempRow, lnCol] + 1, ;
        aryTable[lnRow, lnTempCol] + 1, ;
        aryTable[lnTempRow, lnTempCol] + lnCost)
    ENDFOR
  ENDFOR
  RETURN aryTable[lnLenTarget, lnLenSource]
ENDFUNC

I had to compare some fields on middle sized tables, and found the performance of both versions somewhat lacking and decided to code a bare bones version starting from Craig's code, which calculates the percentage difference as well. This is not totally in line with "a function should do one thing only", but in my case the similarity is best described with both values and minimizes overhead (just one line, as anything else is already there). Any string manipulation, like the Upper() in Craig's code, should be done outside the function in my opinion. A previous call to Alltrim() is highly reccommended if you are working from table fields. A few more cycles could be eliminated by previous mapping of tcSource a dedicated array, but that would add nearly a quarter of the code for savings in the 2% range. The code is shorter, and - at least for me - easier to visualize, as the base variables are the loop counters and the different rows and columns are easy to spot. Depending on the Data I throw at it, this is between 30-40% faster than Craig's implementation and takes about half the time of the first implementation. But if I have to use this regularly or on large tables I probably will build a fll, even if the "tightlooped" code is now only one rather long line. I've also added the DamerauLevenShteinDistance, which allows for transposition. This is nearly identical code, but as most of the difference is inside the inner loop and therefore nearly always called, I've chosen the more ugly code duplication to get utmost speed for working on potentially large tables instead of a more elegant solution.
What is DamerauLevenshteinAlgorithm? In short: the distance "Levenshtein" vs. "Levensthein" as measured in Levenshtein is 2, whereas in DamerauLevenshtein it is 1, as a transposition is counted as 1 exchange instead of the 2 changes in the traditional Levenshtein. ThomasGanss

#define C_NDISTANCE	aryTable[m.lnLenRows, m.lnLenCols]
**********************************
FUNCTION LevenshteinDist_Perc(tcTarget, tcSource, tnOptPercDiff)
  **********************************
  LOCAL lnDistance, lnLenTarget, lnLenSource, lcRowChar, lnNxtRow, lnRow, lnCol
  lnLenRows = LEN(m.tcTarget) + 1
  lnLenCols = LEN(m.tcSource) + 1
  DIMENSION aryTable(m.lnLenRows, m.lnLenCols)
  STORE 0 TO aryTable

  FOR lnCol = 1 TO m.lnLenCols
    aryTable[1, m.lnCol] = m.lnCol - 1
  ENDFOR

  FOR lnRow = 1 TO m.lnLenRows - 1
    lnNxtRow = m.lnRow + 1
    aryTable[m.lnNxtRow, 1] = m.lnRow
    lcRowChar = SUBSTR(m.tcTarget, m.lnRow, 1)
    FOR lnCol = 1 TO m.lnLenCols - 1
	aryTable[m.lnNxtRow, m.lnCol+1] = MIN( ;
	   aryTable[m.lnRow, m.lnCol] + IIF( m.lcRowChar==SUBSTR(m.tcSource,m.lnCol,1), 0, 1), ;
	   aryTable[m.lnRow, m.lnCol+1] + 1, ;
	   aryTable[m.lnNxtRow, m.lnCol] + 1 )
    ENDFOR
  ENDFOR

  tnOptPercDiff = IIF(m.lnLenRows + m.lnLenCols = 2, 0.0, ;
   		100.0 - 100.0 * C_NDISTANCE / (max(m.lnLenRows, m.lnLenCols)-1) )
  RETURN C_NDISTANCE	
ENDFUNC

**********************************
FUNCTION DamerauLevenshteinDist_Perc(tcTarget, tcSource, tnOptPercDiff)
  **********************************
  LOCAL lnDistance, lnLenTarget, lnLenSource, lcLastRowChar, lcRowChar, lnNxtRow, lnRow, lnCol
  lnLenRows = LEN(m.tcTarget) + 1
  lnLenCols = LEN(m.tcSource) + 1
  DIMENSION aryTable(m.lnLenRows, m.lnLenCols)
  STORE 0 TO aryTable

  FOR lnCol = 1 TO m.lnLenCols
    aryTable[1, m.lnCol] = m.lnCol - 1
  ENDFOR

  lcRowChar = ""
  FOR lnRow = 1 TO m.lnLenRows - 1
    lnNxtRow = m.lnRow + 1
    aryTable[m.lnNxtRow, 1] = m.lnRow
    lcLastRowChar = m.lcRowChar
    lcRowChar = SUBSTR(m.tcTarget, m.lnRow, 1)
    FOR lnCol = 1 TO m.lnLenCols - 1
	aryTable[m.lnNxtRow, m.lnCol+1] = MIN( ;
	   aryTable[m.lnRow, m.lnCol] + IIF( m.lcRowChar==SUBSTR(m.tcSource,m.lnCol,1), 0, 1), ;
	   aryTable[m.lnRow, m.lnCol+1] + 1, ;
	   aryTable[m.lnNxtRow, m.lnCol] + 1 )
			
        IF  m.lcLastRowChar == SUBSTR(m.tcSource,m.lnCol,1) ;
            and m.lcRowChar == SUBSTR(m.tcSource,m.lnCol-1,1)

           aryTable[m.lnNxtRow, m.lnCol+1] = min( aryTable[m.lnNxtRow, m.lnCol+1], ;
                                         aryTable[m.lnRow-1,  m.lnCol-1] + 1)  && // transposition
        endif
    ENDFOR
  ENDFOR

  tnOptPercDiff = IIF(m.lnLenRows + m.lnLenCols = 2, 0.0, ;
   		100.0 - 100.0 * C_NDISTANCE / (max(m.lnLenRows, m.lnLenCols)-1) )
  RETURN C_NDISTANCE	
ENDFUNC

Quite interesting idea this, oddly someone asked me if foxpro had a "percentage similarity of strings" function the other day.
I looked up this article and saw your take on it, but isn't all that array manipulation hard work?
Have I missed something in my simplification?

Sorry, I broke Hungarian Notation (I only use FPW here and I never really saw the use of having the variable type in the prefix.)

**********************************
FUNCTION levens2
**********************************
PARAMETERS m.pstring1, m.pstring2, m.ptype
PRIVATE ALL LIKE l*
* return Levenshtein distance between 2 strings
* ie number of characters difference
* pstring1 - a string
* pstring2 - another string
* ptype    - 1 for number of characters
*            2 for percentage similarity

*!* Remove the following two lines of code
*!* if you prefer a case-sensitive distance

m.pstring1 = UPPER(m.pstring1)

m.pstring2 = UPPER(m.pstring2)

m.lminlen=MIN(LEN(m.pstring1), LEN(m.pstring2))
m.lmaxlen=MAX(LEN(m.pstring1), LEN(m.pstring2))

m.ldist=0
FOR m.lchar=1 TO m.lminlen
	m.ldist=m.ldist+IIF(SUBSTR(m.pstring1, m.lchar, 1)==SUBSTR(m.pstring2, m.lchar, 1), 0, 1)
NEXT m.lchar
m.ldist=m.ldist+(m.lmaxlen-m.lminlen)

RETURN IIF(m.ptype=1, m.ldist, 100*(1.0-(m.ldist/m.lmaxlen)))


and a quick Wiki question.. How come in the "pre formatted text" tags above, it's suppressing adjacent spaces in my comments as per standard HTML?

Yes, you missed something :-). The Levenshtein-algorithm counts the changes needed in order to change string1 to string2. There are 3 possible operations: change one letter, insert a letter and remove one letter. The Levenshtein-distance of "colorful"/"colourful" is 1 because the only operation needed is inserting a 'u' - while your algorithm returns 5. Btw. I would keep your function in this wikipage because one sees better what Levenshtein is and what not.
-- Kurt (glad to see someone else in the fox-community also doubts hungarian notation)

Category Code Samples Category Algorithms