pandas 使用jaccard相似度的Python Pandas距离矩阵

Question

提问by J-H

I have implemented a function to construct a distance matrix using the jaccard similarity:

我已经实现了一个使用 jaccard 相似度构造距离矩阵的函数：

import pandas as pd
entries = [
    {'id':'1', 'category1':'100', 'category2': '0', 'category3':'100'},
    {'id':'2', 'category1':'100', 'category2': '0', 'category3':'100'},
    {'id':'3', 'category1':'0', 'category2': '100', 'category3':'100'},
    {'id':'4', 'category1':'100', 'category2': '100', 'category3':'100'},
    {'id':'5', 'category1':'100', 'category2': '0', 'category3':'100'}
           ]
df = pd.DataFrame(entries)

and the distance matrix with scipy

和 scipy 的距离矩阵

from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist, jaccard

res = pdist(df[['category1','category2','category3']], 'jaccard')
squareform(res)
distance = pd.DataFrame(squareform(res), index=df.index, columns= df.index)

The problem is that my result looks like this which seems to be false:

问题是我的结果看起来像这样，这似乎是错误的：

What am i missing? The similarity of 0 and 1 have to be maximum for example and the other values seem wrong too

我错过了什么？例如，0 和 1 的相似度必须是最大的，其他值似乎也是错误的

Answer 1

采纳答案by root

Looking at the docs, the implementation of jaccardin scipy.spatial.distanceis jaccard dissimilarity, not similarity. This is the usual way in which distance is computed when using jaccard as a metric. The reason for this is because in order to be a metric, the distance between the identical points must be zero.

查看文档，jaccardin的实现scipy.spatial.distance是 jaccard dissimilarity，而不是相似性。这是使用 jaccard 作为度量时计算距离的常用方法。这样做的原因是，为了成为度量，相同点之间的距离必须为零。

In your code, the dissimilarity between 0 and 1 should be minimized, which it is. The other values look correct in the context of dissimilarity as well.

在您的代码中，应该最小化 0 和 1 之间的差异，事实就是如此。其他值在不同的上下文中看起来也是正确的。

If you want similarity instead of dissimilarity, just subtract the dissimilarity from 1.

如果你想要相似而不是相异，只需从 1 中减去相异。

res = 1 - pdist(df[['category1','category2','category3']], 'jaccard')

pandas 使用jaccard相似度的Python Pandas距离矩阵

提问by J-H

采纳答案by root

相关推荐

最近更新

标签

pandas 使用jaccard相似度的Python Pandas距离矩阵

提问by J-H

采纳答案by root

相关推荐

pandas 将每一行与数据框中的所有行进行比较，并将结果保存在每行的列表中

Pandas：将带有空字符串的列转换为浮动

当 GroupBy 对象可能不包含某个键时，如何避免 Pandas Groupby 键错误

pandas 如何将文件路径变量放入pandas.read_csv？

相关推荐

最近更新

标签