(Updated: 2006.09.29 03:37:18 AM)
| |
What is LevenshteinAlgorithm?
* ------------------------------------------------------------------------
* Levenshtein.prg
* ------------------------------------------------------------------------
** < summary>
** A base class for fancier techniques using levenshtein edit distance
** algorithm.
** < /summary>
define class FuzzyMatch as custom
** < summary>
** Computes the edit distance between two strings
** using the Levenshtein algorithm. The edit distance is how many
** changes need to be made to make them identical. The edit distance
** for identical strings is 0.0, the edit distance for completely
** different strings depends on what letters need to change or
** be added in order to make them the same.
** < /summary>
function EditDistance ( ;
sSource as string, ;
sTarget as string ) as decimal
local ;
iLenSource as integer, ;
iLenTarget as integer, ;
iRow as integer, ;
iCol as integer
iLenSource = len( sSource )
iLenTarget = len( sTarget )
if iLenSource == 0.0
return iLenTarget
endif
if iLenTarget == 0.0
return iLenSource
endif
dimension Levenshtein( iLenTarget+1, iLenSource+1 )
for iRow = 1 to iLenTarget + 1
Levenshtein[iRow,1] = iRow -1
endfor
for iCol = 1 to iLenSource + 1
Levenshtein[1,iCol] = iCol -1
endfor
for iCol = 2 to iLenSource + 1
for iRow = 2 to iLenTarget + 1
local ;
dCost as decimal, ;
cColChar as string, ;
cRowChar as string
dCost = 0.0
cColChar = substr( sSource, iCol-1, 1 )
cRowChar = substr( sTarget, iRow-1, 1 )
if cColChar != cRowChar
dCost = 1.0
endif
Levenshtein[iRow,iCol] = ;
min( ;
Levenshtein[iRow-1,iCol] + 1, ;
Levenshtein[iRow,iCol-1] + 1, ;
Levenshtein[iRow-1,iCol-1] + dCost )
endfor
endfor
return Levenshtein[iLenTarget+1,iLenSource+1]
endfunc
** < summary>
** Compares two strings and returns the % match using
** the EditDistance function.
** < /summary>
function PercentMatch( ;
sSource as string, ;
sTarget as string ) as decimal
local ;
iLenSource as integer, ;
iLenTarget as integer, ;
dResult as decimal, ;
dEditDistance as decimal
iLenSource = len( sSource )
iLenTarget = len( sTarget )
if iLenSource + iLenTarget == 0
return 0.0
else
dEditDistance = this.EditDistance( sSource, sTarget)
dResult = (1.0-(dEditDistance/max(iLenSource,iLenTarget))) * 100.0
return dResult
endif
endfunc
enddefine
** < summary>
** This class builds on FuzzyMatch to provide a nifty logic
** to come up with weighted % match between human names. So, if you
** want to search for a certain percentage match as a filter
** of some result set based on a given name as an input, you'd
** create a FuzzyName with those inputs, then compare to your
** result set by calling the FuzzyMatch.PercentMatchTo( fname,
** mname, lname, gen ) method.
** < code>
** oFuzzyName = createobject("FuzzyName", "Robert", "B.", "Calco", "")
** *
** ? oFuzzyName.PercentMatchTo(sFirstName,sMiddleName,sLastname,sGeneration )
** ? oFuzzyName.IsWeakMatchTo(sFirstName,sMiddleName,sLastname,sGeneration )
** ? oFuzzyName.IsStrongMatchTo(sFirstName,sMiddleName,sLastname,sGeneration )
** *
** < /code>
** It also lets you set your own weight (defaults: 20%,20%,50%,10%, respectively).
** You can also redefine a "strong" match (default: 90%) or a "weak" match
** (default 70%) as a two-tier cutoff system.
** < /summary>
define class FuzzyName as FuzzyMatch
** First name component of the FuzzyName
protected sFirstName as string
** Middle initial/name component of the FuzzyName
protected sMiddleInitial as string
** LastName component of the FuzzyName
protected sLastName as string
** Generation (Jr., Sr., III, etc.) component of the FuzzyName
protected sGeneration as string
** Percentage weight to give to the first name
protected dFNWeight as decimal
** Percentage weight to give to the middle initial/name
protected dMIWeight as decimal
** Percentage weight to give to the last name
protected dLNWeight as decimal
** Percentage weight to give to the generation
protected dGWeight as decimal
** Defines the quality of a "strong" match to another name
protected dStrongMatchPct as decimal
** Defines the quality of a "weak" match to another name. Can
** be used to "cutoff" nonsense matches. Define to your requirements.
protected dWeakMatchPct as decimal
** < summary>
** Constructor; sets the name components based on input and
** establishes the default weights and definition of strong vs.
** weak matches based on my experience. Your mileage may vary.
** < /summary>
procedure init( ;
sFirstName as string, ;
sMiddleInitial as string, ;
sLastName as string, ;
sGeneration as string )
* set the pieces and parts of this name
this.sFirstName = sFirstName
this.sMiddleInitial = sMiddleInitial
this.sLastName = sLastName
this.sGeneration = sGeneration
* default weights
this.dFNWeight = 0.2 && First name worth 20%
this.dMIWeight = 0.2 && Middle initial worth 20%
this.dLNWeight = 0.5 && Last name worth 50%
this.dGWeight = 0.1 && Generation worth a measly 10%
* default Strong/Weak match pcts
this.dStrongMatchPct = 90.0 && 90% match or greater
this.dWeakMatchPct = 70.0 && 70% match or greater
endproc
** < summary>
** A way to change the default weights. Call this before
** making any comparisons if you don't like mine.
** < /summary>
procedure SetWeights( ;
dFirstName as decimal, ;
dMiddleInitial as decimal, ;
dLastName as decimal, ;
dGeneration as decimal )
this.dFNWeight = dFirstName
this.dMIWeight = dMiddleInitial
this.dLNWeight = dLastName
this.dGWeight = dGeneration
* TODO throw error if totals != 1.0
endproc
** < summary>
** Sets the definition of a strong match if you think 90%
** is bogus.
** < /summary>
procedure SetStrongMatchPct( dStrongMatchPct as decimal )
this.dStrongMatchPct = dStrongMatchPct
endproc
** < summary>
** Sets the definition of a weak match if you think 70%
** is bogus.
** < /summary>
procedure SetWeakMatchPct( dWeakMatchPct as decimal )
this.dWeakMatchPct = dWeakMatchPct
endproc
** < summary>
** Returns boolean based on whether or not this FuzzyName
** is at least a weak match to the supplied name component inputs.
** < /summary>
function IsWeakMatchTo( ;
sFirstName as string, ;
sMiddleInitial as string, ;
sLastName as string, ;
sGeneration as string ) as Boolean
return this.PercentMatchTo( ;
sFirstName, ;
sMiddleInitial, ;
sLastName, ;
sGeneration ) >= this.dWeakMatchPct
endfunc
** < summary>
** Returns a boolean based on whetehr or not this FuzzyName
** is at least a strong match to the supplied name component inputs.
** < /summary>
function IsStrongMatchTo( ;
sFirstName as string, ;
sMiddleInitial as string, ;
sLastName as string, ;
sGeneration as string ) as Boolean
return this.PercentMatchTo( ;
sFirstName, ;
sMiddleInitial, ;
sLastName, ;
sGeneration ) >= this.dStrongMatchPct
endfunc
** < summary>
** This is the workhorse function that takes the components of
** another name and applies some heuristics to come up with an overall
** weighted match percentage. Should be obvious except for the additional
** logic to catch a particular Special Case; namely, when the only
** difference between the two names is a mismatch between the first
** and middle names. For instance "Bob B. Calco" and "B. Bob Calco".
** This was a common enough issue for us that we decided to go ahead and
** implement it. We basically force the percent match to be 100%-1% or 99%.
** If you see a 99% match, 99.99% of the time it will be for this reason.
** < /summary>
function PercentMatchTo( ;
sFirstName as string, ;
sMiddleInitial as string, ;
sLastName as string, ;
sGeneration as string ) as decimal
local ;
dFirstNamePct as decimal, ;
dMiddleInitialPct as decimal, ;
dLastNamePct as decimal, ;
dGenerationPct as decimal, ;
dResult as decimal, ;
dFudgeFactor as decimal
* Get the % match on the first names. Picking the max of the
* match to the first and middle names.
dFirstNamePct = max( ;
this.PercentMatch( this.sFirstName, sFirstName ), ;
this.PercentMatch( this.sFirstName, sMiddleInitial ) )
* Get the % match on the middle names. This is tricky only because
* of the Special Case.
if this.sMiddleInitial == "" and sMiddleInitial == ""
dMiddleInitialPct = max( ;
100.0, ;
this.PercentMatch( this.sMiddleInitial, sFirstName ) )
else
dMiddleInitialPct = max( ;
this.PercentMatch( this.sMiddleInitial, sMiddleInitial ), ;
this.PercentMatch( this.sMiddleInitial, sFirstName ) )
endif
* Get the % match on the last names
dLastNamePct = this.PercentMatch( this.sLastName, sLastName );
* Get the % match on Generations. the only special consideration
* here is that if they are both blank, rather than penalize the
* whole name and lose 10% of an otherwise perfect match, we'll call
* it 100% match---since it's true: both names have no supplied generation.
if this.sGeneration == "" and sGeneration == ""
dGenerationPct = 100.0
else
dGenerationPct = this.PercentMatch( this.sGeneration, sGeneration)
endif
* Return the weighted %, with adjustment for Special Case
if this.sFirstName == sMiddleInitial and this.sMiddleInitial == sFirstName
dFudgeFactor = 1.0
else
dFudgeFactor = 0.0
endif
dResult = ;
(this.dFNWeight * dFirstNamePct) + ;
(this.dMIWeight * dMiddleInitialPct) + ;
(this.dLNWeight * dLastNamePct ) + ;
(this.dGWeight * dGenerationPct ) - dFudgeFactor
return dResult
endfunc
enddefine

It would seem that the VFP implementation of this algorithm could be better. Below is what I come up with. I've opted for case-insensitive. And, it should be noted that when used against a dictionary (such as in a spellchecker), I would change the function below so that value sent in to tcSource would already be uppercase and a third parameter would be added so the Len + 1 of tcSource could also be sent in... both of these values (uppercase of tcSource and len + 1 of tcSource) would be held outside the scan (or whatever was calling the function) in variables as they would never changed during the pass through the dictionary table. --
Craig SBoyd
**********************************
FUNCTION LevenshteinDistance(tcTarget, tcSource)
**********************************
LOCAL lnLenTarget, lnLenSource, lnRow, lnCol, lnTempRow, lnTempCol, lnCost
lnLenTarget = LEN(tcTarget) + 1
lnLenSource = LEN(tcSource) + 1
DIMENSION aryTable(lnLenTarget, lnLenSource)
STORE 0 TO aryTable
*!* Remove the following two lines of code
*!* if you prefer a case-sensitive distance
tcTarget = UPPER(tcTarget)
tcSource = UPPER(tcSource)
FOR lnCol = 1 TO lnLenSource
aryTable[1, lnCol] = lnCol - 1
ENDFOR
FOR lnRow = 2 TO lnLenTarget
lnTempRow = lnRow - 1
aryTable[lnRow, 1] = lnTempRow
FOR lnCol = 2 TO lnLenSource
lnTempCol = lnCol - 1
IF SUBSTR(tcTarget, lnTempRow, 1) == SUBSTR(tcSource, lnTempCol, 1)
lnCost = 0
ELSE
lnCost = 1
ENDIF
aryTable[lnRow, lnCol] = MIN(aryTable[lnTempRow, lnCol] + 1, ;
aryTable[lnRow, lnTempCol] + 1, ;
aryTable[lnTempRow, lnTempCol] + lnCost)
ENDFOR
ENDFOR
RETURN aryTable[lnLenTarget, lnLenSource]
ENDFUNC
I had to compare some fields on middle sized tables, and found the performance of both versions somewhat lacking and decided to code a bare bones version starting from Craig's code, which calculates the percentage difference as well. This is not totally in line with "a function should do one thing only", but in my case the similarity is best described with both values and minimizes overhead (just one line, as anything else is already there). Any string manipulation, like the Upper() in Craig's code, should be done outside the function in my opinion. A previous call to Alltrim() is highly reccommended if you are working from table fields. A few more cycles could be eliminated by previous mapping of tcSource a dedicated array, but that would add nearly a quarter of the code for savings in the 2% range. The code is shorter, and - at least for me - easier to visualize, as the base variables are the loop counters and the different rows and columns are easy to spot. Depending on the Data I throw at it, this is between 30-40% faster than Craig's implementation and takes about half the time of the first implementation. But if I have to use this regularly or on large tables I probably will build a fll, even if the "tightlooped" code is now only one rather long line. I've also added the
DamerauLevenShteinDistance, which allows for transposition. This is nearly identical code, but as most of the difference is inside the inner loop and therefore nearly always called, I've chosen the more ugly code duplication to get utmost speed for working on potentially large tables instead of a more elegant solution.
What is DamerauLevenshteinAlgorithm? In short: the distance "Levenshtein" vs. "Levensthein" as measured in Levenshtein is 2, whereas in
DamerauLevenshtein it is 1, as a transposition is counted as 1 exchange instead of the 2 changes in the traditional Levenshtein.
ThomasGanss
#define C_NDISTANCE aryTable[m.lnLenRows, m.lnLenCols]
**********************************
FUNCTION LevenshteinDist_Perc(tcTarget, tcSource, tnOptPercDiff)
**********************************
LOCAL lnDistance, lnLenTarget, lnLenSource, lcRowChar, lnNxtRow, lnRow, lnCol
lnLenRows = LEN(m.tcTarget) + 1
lnLenCols = LEN(m.tcSource) + 1
DIMENSION aryTable(m.lnLenRows, m.lnLenCols)
STORE 0 TO aryTable
FOR lnCol = 1 TO m.lnLenCols
aryTable[1, m.lnCol] = m.lnCol - 1
ENDFOR
FOR lnRow = 1 TO m.lnLenRows - 1
lnNxtRow = m.lnRow + 1
aryTable[m.lnNxtRow, 1] = m.lnRow
lcRowChar = SUBSTR(m.tcTarget, m.lnRow, 1)
FOR lnCol = 1 TO m.lnLenCols - 1
aryTable[m.lnNxtRow, m.lnCol+1] = MIN( ;
aryTable[m.lnRow, m.lnCol] + IIF( m.lcRowChar==SUBSTR(m.tcSource,m.lnCol,1), 0, 1), ;
aryTable[m.lnRow, m.lnCol+1] + 1, ;
aryTable[m.lnNxtRow, m.lnCol] + 1 )
ENDFOR
ENDFOR
tnOptPercDiff = IIF(m.lnLenRows + m.lnLenCols = 2, 0.0, ;
100.0 - 100.0 * C_NDISTANCE / (max(m.lnLenRows, m.lnLenCols)-1) )
RETURN C_NDISTANCE
ENDFUNC
**********************************
FUNCTION DamerauLevenshteinDist_Perc(tcTarget, tcSource, tnOptPercDiff)
**********************************
LOCAL lnDistance, lnLenTarget, lnLenSource, lcLastRowChar, lcRowChar, lnNxtRow, lnRow, lnCol
lnLenRows = LEN(m.tcTarget) + 1
lnLenCols = LEN(m.tcSource) + 1
DIMENSION aryTable(m.lnLenRows, m.lnLenCols)
STORE 0 TO aryTable
FOR lnCol = 1 TO m.lnLenCols
aryTable[1, m.lnCol] = m.lnCol - 1
ENDFOR
lcRowChar = ""
FOR lnRow = 1 TO m.lnLenRows - 1
lnNxtRow = m.lnRow + 1
aryTable[m.lnNxtRow, 1] = m.lnRow
lcLastRowChar = m.lcRowChar
lcRowChar = SUBSTR(m.tcTarget, m.lnRow, 1)
FOR lnCol = 1 TO m.lnLenCols - 1
aryTable[m.lnNxtRow, m.lnCol+1] = MIN( ;
aryTable[m.lnRow, m.lnCol] + IIF( m.lcRowChar==SUBSTR(m.tcSource,m.lnCol,1), 0, 1), ;
aryTable[m.lnRow, m.lnCol+1] + 1, ;
aryTable[m.lnNxtRow, m.lnCol] + 1 )
IF m.lcLastRowChar == SUBSTR(m.tcSource,m.lnCol,1) ;
and m.lcRowChar == SUBSTR(m.tcSource,m.lnCol-1,1)
aryTable[m.lnNxtRow, m.lnCol+1] = min( aryTable[m.lnNxtRow, m.lnCol+1], ;
aryTable[m.lnRow-1, m.lnCol-1] + 1) && // transposition
endif
ENDFOR
ENDFOR
tnOptPercDiff = IIF(m.lnLenRows + m.lnLenCols = 2, 0.0, ;
100.0 - 100.0 * C_NDISTANCE / (max(m.lnLenRows, m.lnLenCols)-1) )
RETURN C_NDISTANCE
ENDFUNC
Quite interesting idea this, oddly someone asked me if foxpro had a "percentage similarity of strings" function the other day.
I looked up this article and saw your take on it, but isn't all that array manipulation hard work?
Have I missed something in my simplification?
Sorry, I broke Hungarian Notation (I only use FPW here and I never really saw the use of having the variable type in the prefix.)
**********************************
FUNCTION levens2
**********************************
PARAMETERS m.pstring1, m.pstring2, m.ptype
PRIVATE ALL LIKE l*
* return Levenshtein distance between 2 strings
* ie number of characters difference
* pstring1 - a string
* pstring2 - another string
* ptype - 1 for number of characters
* 2 for percentage similarity
*!* Remove the following two lines of code
*!* if you prefer a case-sensitive distance
m.pstring1 = UPPER(m.pstring1)
m.pstring2 = UPPER(m.pstring2)
m.lminlen=MIN(LEN(m.pstring1), LEN(m.pstring2))
m.lmaxlen=MAX(LEN(m.pstring1), LEN(m.pstring2))
m.ldist=0
FOR m.lchar=1 TO m.lminlen
m.ldist=m.ldist+IIF(SUBSTR(m.pstring1, m.lchar, 1)==SUBSTR(m.pstring2, m.lchar, 1), 0, 1)
NEXT m.lchar
m.ldist=m.ldist+(m.lmaxlen-m.lminlen)
RETURN IIF(m.ptype=1, m.ldist, 100*(1.0-(m.ldist/m.lmaxlen)))
and a quick Wiki question.. How come in the "pre formatted text" tags above, it's suppressing adjacent spaces in my comments as per standard HTML?
Yes, you missed something :-). The Levenshtein-algorithm counts the changes needed in order to change string1 to string2. There are 3 possible operations: change one letter, insert a letter and remove one letter. The Levenshtein-distance of "colorful"/"colourful" is 1 because the only operation needed is inserting a 'u' - while your algorithm returns 5. Btw. I would keep your function in this wikipage because one sees better what Levenshtein is and what not.
-- Kurt (glad to see someone else in the fox-community also doubts hungarian notation)
Category Code Samples Category Algorithms