pandas 使用jaccard相似度的Python Pandas距离矩阵
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35639571/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas Distance matrix using jaccard similarity
提问by J-H
I have implemented a function to construct a distance matrix using the jaccard similarity:
我已经实现了一个使用 jaccard 相似度构造距离矩阵的函数:
import pandas as pd
entries = [
{'id':'1', 'category1':'100', 'category2': '0', 'category3':'100'},
{'id':'2', 'category1':'100', 'category2': '0', 'category3':'100'},
{'id':'3', 'category1':'0', 'category2': '100', 'category3':'100'},
{'id':'4', 'category1':'100', 'category2': '100', 'category3':'100'},
{'id':'5', 'category1':'100', 'category2': '0', 'category3':'100'}
]
df = pd.DataFrame(entries)
and the distance matrix with scipy
和 scipy 的距离矩阵
from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist, jaccard
res = pdist(df[['category1','category2','category3']], 'jaccard')
squareform(res)
distance = pd.DataFrame(squareform(res), index=df.index, columns= df.index)
The problem is that my result looks like this which seems to be false:
问题是我的结果看起来像这样,这似乎是错误的:
What am i missing? The similarity of 0 and 1 have to be maximum for example and the other values seem wrong too
我错过了什么?例如,0 和 1 的相似度必须是最大的,其他值似乎也是错误的
采纳答案by root
Looking at the docs, the implementation of jaccard
in scipy.spatial.distance
is jaccard dissimilarity, not similarity. This is the usual way in which distance is computed when using jaccard as a metric. The reason for this is because in order to be a metric, the distance between the identical points must be zero.
查看文档,jaccard
in的实现scipy.spatial.distance
是 jaccard dissimilarity,而不是相似性。这是使用 jaccard 作为度量时计算距离的常用方法。这样做的原因是,为了成为度量,相同点之间的距离必须为零。
In your code, the dissimilarity between 0 and 1 should be minimized, which it is. The other values look correct in the context of dissimilarity as well.
在您的代码中,应该最小化 0 和 1 之间的差异,事实就是如此。其他值在不同的上下文中看起来也是正确的。
If you want similarity instead of dissimilarity, just subtract the dissimilarity from 1.
如果你想要相似而不是相异,只需从 1 中减去相异。
res = 1 - pdist(df[['category1','category2','category3']], 'jaccard')