Python 何时使用哪个模糊函数来比较 2 个字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31806695/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:33:54  来源:igfitidea点击:

When to use which fuzz function to compare 2 strings

pythonstring-comparisonfuzzywuzzy

提问by Pot

I am learning fuzzywuzzyin Python.

我正在fuzzywuzzy用 Python学习。

I understand the concept of fuzz.ratio, fuzz.partial_ratio, fuzz.token_sort_ratioand fuzz.token_set_ratio. My question is when to use which function?

我理解的概念fuzz.ratiofuzz.partial_ratiofuzz.token_sort_ratiofuzz.token_set_ratio。我的问题是什么时候使用哪个函数?

  • Do I check the 2 strings' length first, say if not similar, then rule out fuzz.partial_ratio?
  • If the 2 strings' length are similar, I'll use fuzz.token_sort_ratio?
  • Should I always use fuzz.token_set_ratio?
  • 我是否先检查 2 个字符串的长度,如果不相似,然后排除fuzz.partial_ratio
  • 如果 2 个字符串的长度相似,我将使用 fuzz.token_sort_ratio?
  • 我应该一直使用fuzz.token_set_ratio吗?

Anyone knows what criteria SeatGeek uses?

有人知道 SeatGeek 使用什么标准吗?

I am trying to build a real estate website, thinking to use fuzzywuzzyto compare addresses.

我正在尝试建立一个房地产网站,想fuzzywuzzy用来比较地址。

采纳答案by Rick Hanlon II

Great question.

很好的问题。

I'm an engineer at SeatGeek, so I think I can help here. We have a great blog postthat explains the differences quite well, but I can summarize and offer some insight into how we use the different types.

我是 SeatGeek 的工程师,所以我想我可以在这里提供帮助。我们有一篇很棒的博客文章很好地解释了差异,但我可以总结并提供一些关于我们如何使用不同类型的见解。

Overview

概述

Under the hood each of the four methods calculate the edit distance between some ordering of the tokens in both input strings. This is done using the difflib.ratiofunction which will:

在底层,这四种方法中的每一种都计算两个输入字符串中标记的某些排序之间的编辑距离。这是使用进行difflib.ratio功能这将

Return a measure of the sequences' similarity (float in [0,1]).

Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1 if the sequences are identical, and 0 if they have nothing in common.

返回序列相似性的度量(在 [0,1] 中浮动)。

其中 T 是两个序列中元素的总数,M 是匹配的数量,这是 2.0*M / T。请注意,如果序列相同,则为 1,如果没有共同点,则为 0。

The four fuzzywuzzy methods call difflib.ratioon different combinations of the input strings.

四种模糊模糊方法调用difflib.ratio输入字符串的不同组合。

fuzz.ratio

模糊比

Simple. Just calls difflib.ratioon the two input strings (code).

简单的。只需调用difflib.ratio两个输入字符串(code)。

fuzz.ratio("NEW YORK METS", "NEW YORK MEATS")
> 96

fuzz.partial_ratio

fuzz.partial_ratio

Attempts to account for partial string matches better. Calls ratiousing the shortest string (length n) against all n-length substrings of the larger string and returns the highest score (code).

尝试更好地解释部分字符串匹配。ratio使用最短字符串(长度 n)对较大字符串的所有 n 长度子字符串进行调用,并返回最高分(code)。

Notice here that "YANKEES" is the shortest string (length 7), and we run the ratio with "YANKEES" against all substrings of length 7 of "NEW YORK YANKEES" (which would include checking against "YANKEES", a 100% match):

请注意,“YANKEES”是最短的字符串(长度为 7),我们将“YANKEES”与“NEW YORK YANKEES”的所有长度为 7 的子字符串(包括检查“YANKEES”,100%匹配):

fuzz.ratio("YANKEES", "NEW YORK YANKEES")
> 60
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES")
> 100

fuzz.token_sort_ratio

fuzz.token_sort_ratio

Attempts to account for similar strings out of order. Calls ratioon both strings after sorting the tokens in each string (code). Notice here fuzz.ratioand fuzz.partial_ratioboth fail, but once you sort the tokens it's a 100% match:

尝试考虑无序的类似字符串。ratio在对每个字符串 ( code) 中的标记进行排序后调用两个字符串。注意这里fuzz.ratiofuzz.partial_ratio两者都失败了,但是一旦对标记进行排序,它就会 100% 匹配:

fuzz.ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 45
fuzz.partial_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 45
fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 100

fuzz.token_set_ratio

fuzz.token_set_ratio

Attempts to rule out differences in the strings. Calls ratio on three particular substring sets and returns the max (code):

尝试排除字符串中的差异。在三个特定的子字符串集上调用 ratio 并返回最大值(code):

  1. intersection-only and the intersection with remainder of string one
  2. intersection-only and the intersection with remainder of string two
  3. intersection with remainder of one and intersection with remainder of two
  1. 仅交集和与字符串一的剩余部分的交集
  2. 仅交集和与字符串二的剩余部分的交集
  3. 与一的余数相交和与二的余数相交

Notice that by splitting up the intersection and remainders of the two strings, we're accounting for both how similar and different the two strings are:

请注意,通过拆分两个字符串的交集和余数,我们考虑了两个字符串的相似和不同程度:

fuzz.ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 36
fuzz.partial_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 61
fuzz.token_sort_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 51
fuzz.token_set_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 91

Application

应用

This is where the magic happens. At SeatGeek, essentially we create a vector score with each ratio for each data point (venue, event name, etc) and use that to inform programatic decisions of similarity that are specific to our problem domain.

这就是魔法发生的地方。在 SeatGeek,本质上,我们为每个数据点(地点、事件名称等)创建了一个具有每个比率的向量分数,并使用它来通知特定于我们问题域的相似性的程序决策。

That being said, truth by told it doesn't sound like FuzzyWuzzy is useful for your use case. It will be tremendiously bad at determining if two addresses are similar. Consider two possible addresses for SeatGeek HQ: "235 Park Ave Floor 12" and "235 Park Ave S. Floor 12":

话虽如此,但事实并非如此,FuzzyWuzzy 对您的用例很有用。确定两个地址是否相似将非常糟糕。考虑 SeatGeek HQ 的两个可能地址:“235 Park Ave Floor 12”和“235 Park Ave S. Floor 12”:

fuzz.ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 93
fuzz.partial_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 85
fuzz.token_sort_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 95
fuzz.token_set_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 100

FuzzyWuzzy gives these strings a high match score, but one address is our actual office near Union Square and the other is on the other side of Grand Central.

FuzzyWuzzy 为这些字符串提供了很高的匹配分数,但一个地址是我们在联合广场附近的实际办公室,另一个地址在 Grand Central 的另一侧。

For your problem you would be better to use the Google Geocoding API.

对于您的问题,您最好使用Google Geocoding API

回答by Dennis Golomazov

As of June 2017, fuzzywuzzyalso includes some other comparison functions. Here is an overview of the ones missing from the accepted answer (taken from the source code):

截至 2017 年 6 月,fuzzywuzzy还包括一些其他比较功能。以下是已接受的答案中缺少的内容的概述(取自源代码):

fuzz.partial_token_sort_ratio

fuzz.partial_token_sort_ratio

Same algorithm as in token_sort_ratio, but instead of applying ratioafter sorting the tokens, uses partial_ratio.

与 中的算法相同token_sort_ratio,但不是ratio在对标记排序后应用,而是使用partial_ratio

fuzz.token_sort_ratio("New York Mets vs Braves", "Atlanta Braves vs New York Mets")
> 85
fuzz.partial_token_sort_ratio("New York Mets vs Braves", "Atlanta Braves vs New York Mets")
> 100    
fuzz.token_sort_ratio("React.js framework", "React.js")
> 62
fuzz.partial_token_sort_ratio("React.js framework", "React.js")
> 100

fuzz.partial_token_set_ratio

fuzz.partial_token_set_ratio

Same algorithm as in token_set_ratio, but instead of applying ratioto the sets of tokens, uses partial_ratio.

与 中的算法相同token_set_ratio,但不是应用于ratio标记集,而是使用partial_ratio

fuzz.token_set_ratio("New York Mets vs Braves", "Atlanta vs New York Mets")
> 82
fuzz.partial_token_set_ratio("New York Mets vs Braves", "Atlanta vs New York Mets")
> 100    
fuzz.token_set_ratio("React.js framework", "Reactjs")
> 40
fuzz.partial_token_set_ratio("React.js framework", "Reactjs")
> 71   

fuzz.QRatio, fuzz.UQRatio

fuzz.QRatio, fuzz.UQRatio

Just wrappers around fuzz.ratiowith some validation and short-circuiting, included here for completeness. UQRatiois a unicode version of QRatio.

只是包装fuzz.ratio了一些验证和短路,包括在这里是为了完整性。 UQRatio是 的 Unicode 版本QRatio

fuzz.WRatio

fuzz.WRatio

An attempt to weight (the name stands for 'Weighted Ratio') results from different algorithms to calculate the 'best' score. Description from the source code:

尝试对来自不同算法的结果进行加权(名称代表“加权比率”)以计算“最佳”分数。来自源代码的描述:

1. Take the ratio of the two processed strings (fuzz.ratio)
2. Run checks to compare the length of the strings
    * If one of the strings is more than 1.5 times as long as the other
      use partial_ratio comparisons - scale partial results by 0.9
      (this makes sure only full results can return 100)
    * If one of the strings is over 8 times as long as the other
      instead scale by 0.6
3. Run the other ratio functions
    * if using partial ratio functions call partial_ratio,
      partial_token_sort_ratio and partial_token_set_ratio
      scale all of these by the ratio based on length
    * otherwise call token_sort_ratio and token_set_ratio
    * all token based comparisons are scaled by 0.95
      (on top of any partial scalars)
4. Take the highest value from these results
   round it and return it as an integer.

fuzz.UWRatio

模糊.UWRatio

Unicode version of WRatio.

的 Unicode 版本WRatio