如何在 Python 中计算包含字符串的两个列表的 Jaccard 相似度？

Question

提问by Aventinus

I have two lists with usernames and I want to calculate the Jaccard similarity. Is it possible?

我有两个带有用户名的列表，我想计算 Jaccard 相似度。是否可以？

Thisthread shows how to calculate the Jaccard Similarity between two strings, however I want to apply this to two lists, where each element is one word (e.g., a username).

该线程显示了如何计算两个字符串之间的 Jaccard 相似度，但是我想将其应用于两个列表，其中每个元素都是一个单词（例如，用户名）。

Answer 1

回答by Aventinus

I ended up writing my own solution after all:

毕竟我最终编写了自己的解决方案：

def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

Answer 2

回答by iamlcc

@aventinus I don't have enough reputation to add a comment to your answer, but just to make things clearer, your solution measures the jaccard_similaritybut the function is misnamed as jaccard_distance, which is actually 1 - jaccard_similarity

@aventinus 我没有足够的声誉来为您的答案添加评论，但只是为了让事情更清楚，您的解决方案测量了jaccard_similarity但函数被错误命名为jaccard_distance，这实际上是1 - jaccard_similarity

Answer 3

回答by w4bo

For Python 3:

对于 Python 3：

def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return len(s1.intersection(s2)) / len(s1.union(s2))
list1 = ['dog', 'cat', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
jaccard_similarity(list1, list2)
>>> 0.5

For Python2 use return len(s1.intersection(s2)) / float(len(s1.union(s2)))

对于 Python2 使用 return len(s1.intersection(s2)) / float(len(s1.union(s2)))

Answer 4

回答by klaus

Assuming your usernames don't repeat, you can use the same idea:

假设您的用户名不重复，您可以使用相同的想法：

def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

list1 = ['dog', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
# The intersection is ['dog', 'cat']
# union is ['dog', 'cat', 'rat', 'mouse]
words1 = set(list1)
words2 = set(list2)
jaccard(words1, words2)
>>> 0.5

Answer 5

回答by LaSul

You can use the Distancelibrary

您可以使用距离库

#pip install Distance

import distance

distance.jaccard("decide", "resize")

# Returns
0.7142857142857143

Answer 6

回答by Erwin Scholtens

@Aventinus (I also cannot comment): Note that Jaccard similarityis an operation on sets, so in the denominator part it should also use sets (instead of lists). So for example jaccard_similarity('aa', 'ab')should result in 0.5.

@Aventinus（我也无法评论）：请注意，Jaccard相似度是对集合的运算，因此在分母部分它也应该使用集合（而不是列表）。因此，例如jaccard_similarity('aa', 'ab')应该导致0.5.

def jaccard_similarity(list1, list2):
    intersection = len(set(list1).intersection(list2))
    union = len(set(list1)) + len(set(list2)) - intersection

    return intersection / union

Note that in the intersection, there is no need to cast to list first. Also, the cast to float is not needed in Python 3.

注意，在交集的时候，不需要先cast to list。此外，Python 3 中不需要浮点型转换。

Answer 7

回答by kd88

If you'd like to include repeated elements, you can use Counter, which I would imagine is relatively quick since it's just an extended dictunder the hood:

如果您想包含重复的元素，可以使用Counter，我认为它相对较快，因为它只是dict引擎盖下的扩展：

from collections import Counter
def jaccard_repeats(a, b):
    """Jaccard similarity measure between input iterables,
    allowing repeated elements"""
    _a = Counter(a)
    _b = Counter(b)
    c = (_a - _b) + (_b - _a)
    n = sum(c.values())
    return n/(len(a) + len(b) - n)

list1 = ['dog', 'cat', 'rat', 'cat']
list2 = ['dog', 'cat', 'rat']
list3 = ['dog', 'cat', 'mouse']     

jaccard_repeats(list1, list3)      
>>> 0.75

jaccard_repeats(list1, list2) 
>>> 0.16666666666666666

jaccard_repeats(list2, list3)  
>>> 0.5

如何在 Python 中计算包含字符串的两个列表的 Jaccard 相似度？

提问by Aventinus

回答by Aventinus

回答by iamlcc

回答by w4bo

回答by klaus

回答by LaSul

回答by Erwin Scholtens

回答by kd88

相关推荐

最近更新

标签

如何在 Python 中计算包含字符串的两个列表的 Jaccard 相似度？

提问by Aventinus

回答by Aventinus

回答by iamlcc

回答by w4bo

回答by klaus

回答by LaSul

回答by Erwin Scholtens

回答by kd88

相关推荐

Python AttributeError: 模块“tensorflow”没有属性“feature_column”

在 python 中的单词上拆分语音音频文件

加载 MySQLdb 模块时出错“您是否安装了 mysqlclient 或 MySQL-python？”

Python：将列表转换为带有索引的字典

相关推荐

最近更新

标签