Python 找到两个字符串之间的相似度度量

Question

提问by tenstar

How do I get the probability of a string being similar to another string in Python?

如何获得一个字符串与 Python 中另一个字符串相似的概率？

I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standard Python and library.

我想得到一个十进制值，如 0.9（意味着 90%）等。最好使用标准 Python 和库。

e.g.

例如

similar("Apple","Appel") #would have a high prob.

similar("Apple","Mango") #would have a lower prob.

Answer 1

采纳答案by Inbar Rose

There is a built in.

有一个内置的。

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

Using it:

使用它：

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

Answer 2

回答by Saullo G. P. Castro

You can create a function like:

您可以创建一个函数，如：

def similar(w1, w2):
    w1 = w1 + ' ' * (len(w2) - len(w1))
    w2 = w2 + ' ' * (len(w1) - len(w2))
    return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))

Answer 3

回答by hbprotoss

I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to:

我想也许您正在寻找一种描述字符串之间距离的算法。这里有一些你可以参考：

Answer 4

回答by BLT

Fuzzy Wuzzyis a packagethat implements Levenshtein distance in python, with some helper functions to help in certain situations where you may want two distinct strings to be considered identical. For example:

Fuzzy Wuzzy是一个在 python 中实现 Levenshtein distance的包，有一些辅助函数可以在某些情况下提供帮助，您可能希望两个不同的字符串被认为是相同的。例如：

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Answer 5

回答by Enrique Pérez Herrero

Package distanceincludes Levenshtein distance:

包裹距离包括Levenshtein距离：

import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3

Answer 6

回答by Iman Mirzadeh

Solution #1: Python builtin

解决方案 #1：Python 内置

use SequenceMatcherfrom difflib

使用SequenceMatcher从difflib

pros: native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.

优点：原生 python 库，不需要额外的包。
缺点：太有限了，还有很多其他好的字符串相似性算法。

example例子：

>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75

Solution #2: jellyfishlibrary

解决方案#2：水母库

its a very good library with good coverage and few issues. it supports:
- Levenshtein Distance
- Damerau-Levenshtein Distance
- Jaro Distance
- Jaro-Winkler Distance
- Match Rating Approach Comparison
- Hamming Distance

它是一个非常好的图书馆，覆盖面广，问题很少。它支持：
- Levenshtein 距离
- Damerau-Levenshtein 距离
- Jaro 距离
- Jaro-Winkler 距离
- 比赛评分方法比较
- 汉明距离

pros: easy to use, gamut of supported algorithms, tested.
cons: not native library.

优点：易于使用，支持的算法范围广，经过测试。
缺点：不是本地库。

example:

例子：

>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1

Answer 7

回答by damio

The builtin SequenceMatcheris very slow on large input, here's how it can be done with diff-match-patch:

内置SequenceMatcher函数在大输入时非常慢，以下是使用diff-match-patch 完成的方法：

from diff_match_patch import diff_match_patch

def compute_similarity_and_diff(text1, text2):
    dmp = diff_match_patch()
    dmp.Diff_Timeout = 0.0
    diff = dmp.diff_main(text1, text2, False)

    # similarity
    common_text = sum([len(txt) for op, txt in diff if op == 0])
    text_length = max(len(text1), len(text2))
    sim = common_text / text_length

    return sim, diff

Answer 8

回答by Chris_Rands

Note, difflib.SequenceMatcheronlyfinds the longest contiguous matching subsequence, this is often not what is desired, for example:

注意，difflib.SequenceMatcher只找到最长的连续匹配子序列，这通常不是我们想要的，例如：

>>> a1 = "Apple"
>>> a2 = "Appel"
>>> a1 *= 50
>>> a2 *= 50
>>> SequenceMatcher(None, a1, a2).ratio()
0.012  # very low
>>> SequenceMatcher(None, a1, a2).get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=250, b=250, size=0)]  # only the first block is recorded

Finding the similarity between two strings is closely related to the concept of pairwise sequence alignment in bioinformatics. There are many dedicated libraries for this including biopython. This example implements the Needleman Wunsch algorithm:

寻找两个字符串之间的相似性与生物信息学中成对序列比对的概念密切相关。有许多专用库，包括biopython。这个例子实现了Needleman Wunsch 算法：

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.score(a1, a2)
200.0
>>> aligner.algorithm
'Needleman-Wunsch'

Using biopython or another bioinformatics package is more flexible than any part of the python standard library since many different scoring schemes and algorithms are available. Also, you can actually get the matching sequences to visualise what is happening:

使用 biopython 或其他生物信息学包比 python 标准库的任何部分都更灵活，因为有许多不同的评分方案和算法可用。此外，您实际上可以获得匹配序列来可视化正在发生的事情：

>>> alignment = next(aligner.align(a1, a2))
>>> alignment.score
200.0
>>> print(alignment)
Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-
|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-
App-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-el

Answer 9

回答by Mike

You can find most of the text similarity methods and how they are calculated under this link: https://github.com/luozhouyang/python-string-similarity#python-string-similarityHere some examples;

您可以在此链接下找到大多数文本相似度方法及其计算方法：https: //github.com/luozhouyang/python-string-similarity#python-string-similarity这里有一些示例；

Normalized, metric, similarity and distance
(Normalized) similarity and distance
Metric distances
Shingles (n-gram) based similarity and distance
Levenshtein
Normalized Levenshtein
Weighted Levenshtein
Damerau-Levenshtein
Optimal String Alignment
Jaro-Winkler
Longest Common Subsequence
Metric Longest Common Subsequence
N-Gram
Shingle(n-gram) based algorithms
Q-Gram
Cosine similarity
Jaccard index
Sorensen-Dice coefficient
Overlap coefficient (i.e.,Szymkiewicz-Simpson)

归一化、度量、相似性和距离
（归一化）相似度和距离
公制距离
基于带状疱疹（n-gram）的相似性和距离
莱文施泰因
归一化莱文斯坦
加权莱文斯坦
达梅劳-莱文施泰因
最佳字符串对齐
雅罗-温克勒
最长公共子序列
度量最长公共子序列
N-Gram
基于Shingle(n-gram)的算法
Q-Gram
余弦相似度
杰卡德指数
索伦森骰子系数
重叠系数（即 Szymkiewicz-Simpson）

Python 找到两个字符串之间的相似度度量

提问by tenstar

采纳答案by Inbar Rose

回答by Saullo G. P. Castro

回答by hbprotoss

回答by BLT

回答by Enrique Pérez Herrero

回答by Iman Mirzadeh

Solution #1: Python builtin

解决方案 #1：Python 内置

Solution #2: jellyfishlibrary

解决方案#2：水母库

回答by damio

回答by Chris_Rands

回答by Mike

相关推荐

最近更新

标签

Python 找到两个字符串之间的相似度度量

提问by tenstar

采纳答案by Inbar Rose

回答by Saullo G. P. Castro

回答by hbprotoss

回答by BLT

回答by Enrique Pérez Herrero

回答by Iman Mirzadeh

Solution #1: Python builtin

解决方案 #1：Python 内置

Solution #2: jellyfishlibrary

解决方案#2：水母库

回答by damio

回答by Chris_Rands

回答by Mike

相关推荐

读取文件 python 中的上一行

使用 zip 文件安装 python 模块

在 Python 中使用花括号初始化 Set

我如何在python中求幂？

相关推荐

最近更新

标签