如何在 Python 中计算包含字符串的两个列表的 Jaccard 相似度?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46975929/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:57:45  来源:igfitidea点击:

How can I calculate the Jaccard Similarity of two lists containing strings in Python?

pythonpython-3.xsimilarity

提问by Aventinus

I have two lists with usernames and I want to calculate the Jaccard similarity. Is it possible?

我有两个带有用户名的列表,我想计算 Jaccard 相似度。是否可以?

Thisthread shows how to calculate the Jaccard Similarity between two strings, however I want to apply this to two lists, where each element is one word (e.g., a username).

线程显示了如何计算两个字符串之间的 Jaccard 相似度,但是我想将其应用于两个列表,其中每个元素都是一个单词(例如,用户名)。

回答by Aventinus

I ended up writing my own solution after all:

毕竟我最终编写了自己的解决方案:

def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

回答by iamlcc

@aventinus I don't have enough reputation to add a comment to your answer, but just to make things clearer, your solution measures the jaccard_similaritybut the function is misnamed as jaccard_distance, which is actually 1 - jaccard_similarity

@aventinus 我没有足够的声誉来为您的答案添加评论,但只是为了让事情更清楚,您的解决方案测量了jaccard_similarity但函数被错误命名为jaccard_distance,这实际上是1 - jaccard_similarity

回答by w4bo

For Python 3:

对于 Python 3:

def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return len(s1.intersection(s2)) / len(s1.union(s2))
list1 = ['dog', 'cat', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
jaccard_similarity(list1, list2)
>>> 0.5

For Python2 use return len(s1.intersection(s2)) / float(len(s1.union(s2)))

对于 Python2 使用 return len(s1.intersection(s2)) / float(len(s1.union(s2)))

回答by klaus

Assuming your usernames don't repeat, you can use the same idea:

假设您的用户名不重复,您可以使用相同的想法:

def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

list1 = ['dog', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
# The intersection is ['dog', 'cat']
# union is ['dog', 'cat', 'rat', 'mouse]
words1 = set(list1)
words2 = set(list2)
jaccard(words1, words2)
>>> 0.5

回答by LaSul

You can use the Distancelibrary

您可以使用距离

#pip install Distance

import distance

distance.jaccard("decide", "resize")

# Returns
0.7142857142857143

回答by Erwin Scholtens

@Aventinus (I also cannot comment): Note that Jaccard similarityis an operation on sets, so in the denominator part it should also use sets (instead of lists). So for example jaccard_similarity('aa', 'ab')should result in 0.5.

@Aventinus(我也无法评论):请注意,Jaccard相似度是对集合的运算,因此在分母部分它也应该使用集合(而不是列表)。因此,例如jaccard_similarity('aa', 'ab')应该导致0.5.

def jaccard_similarity(list1, list2):
    intersection = len(set(list1).intersection(list2))
    union = len(set(list1)) + len(set(list2)) - intersection

    return intersection / union

Note that in the intersection, there is no need to cast to list first. Also, the cast to float is not needed in Python 3.

注意,在交集的时候,不需要先cast to list。此外,Python 3 中不需要浮点型转换。

回答by kd88

If you'd like to include repeated elements, you can use Counter, which I would imagine is relatively quick since it's just an extended dictunder the hood:

如果您想包含重复的元素,可以使用Counter,我认为它相对较快,因为它只是dict引擎盖下的扩展:

from collections import Counter
def jaccard_repeats(a, b):
    """Jaccard similarity measure between input iterables,
    allowing repeated elements"""
    _a = Counter(a)
    _b = Counter(b)
    c = (_a - _b) + (_b - _a)
    n = sum(c.values())
    return n/(len(a) + len(b) - n)

list1 = ['dog', 'cat', 'rat', 'cat']
list2 = ['dog', 'cat', 'rat']
list3 = ['dog', 'cat', 'mouse']     

jaccard_repeats(list1, list3)      
>>> 0.75

jaccard_repeats(list1, list2) 
>>> 0.16666666666666666

jaccard_repeats(list2, list3)  
>>> 0.5