pandas 使用jaccard相似度的Python Pandas距离矩阵

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35639571/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:45:59  来源:igfitidea点击:

Python Pandas Distance matrix using jaccard similarity

pythonpandasmatrixscipy

提问by J-H

I have implemented a function to construct a distance matrix using the jaccard similarity:

我已经实现了一个使用 jaccard 相似度构造距离矩阵的函数:

import pandas as pd
entries = [
    {'id':'1', 'category1':'100', 'category2': '0', 'category3':'100'},
    {'id':'2', 'category1':'100', 'category2': '0', 'category3':'100'},
    {'id':'3', 'category1':'0', 'category2': '100', 'category3':'100'},
    {'id':'4', 'category1':'100', 'category2': '100', 'category3':'100'},
    {'id':'5', 'category1':'100', 'category2': '0', 'category3':'100'}
           ]
df = pd.DataFrame(entries)

and the distance matrix with scipy

和 scipy 的距离矩阵

from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist, jaccard

res = pdist(df[['category1','category2','category3']], 'jaccard')
squareform(res)
distance = pd.DataFrame(squareform(res), index=df.index, columns= df.index)

The problem is that my result looks like this which seems to be false:

问题是我的结果看起来像这样,这似乎是错误的:

enter image description here

在此处输入图片说明

What am i missing? The similarity of 0 and 1 have to be maximum for example and the other values seem wrong too

我错过了什么?例如,0 和 1 的相似度必须是最大的,其他值似乎也是错误的

采纳答案by root

Looking at the docs, the implementation of jaccardin scipy.spatial.distanceis jaccard dissimilarity, not similarity. This is the usual way in which distance is computed when using jaccard as a metric. The reason for this is because in order to be a metric, the distance between the identical points must be zero.

查看文档,jaccardin的实现scipy.spatial.distance是 jaccard dissimilarity,而不是相似性。这是使用 jaccard 作为度量时计算距离的常用方法。这样做的原因是,为了成为度量,相同点之间的距离必须为零。

In your code, the dissimilarity between 0 and 1 should be minimized, which it is. The other values look correct in the context of dissimilarity as well.

在您的代码中,应该最小化 0 和 1 之间的差异,事实就是如此。其他值在不同的上下文中看起来也是正确的。

If you want similarity instead of dissimilarity, just subtract the dissimilarity from 1.

如果你想要相似而不是相异,只需从 1 中减去相异。

res = 1 - pdist(df[['category1','category2','category3']], 'jaccard')