pandas 计算pandas数据帧行之间的相似度

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28883303/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:01:03  来源:igfitidea点击:

Calculating similarity between rows of pandas dataframe

pythonpandasdataframecosine-similarity

提问by add-semi-colons

Goal is to identify top 10 similar rows for each row in dataframe.

目标是为数据帧中的每一行确定前 10 个相似的行。

I start with following dictionary:

我从以下字典开始:

import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine

d = {'0001': [('skiing',0.789),('snow',0.65),('winter',0.56)],'0002': [('drama', 0.89),('comedy', 0.678),('action',-0.42) ('winter',-0.12),('kids',0.12)],'0003': [('action', 0.89),('funny', 0.58),('sports',0.12)],'0004': [('dark', 0.89),('Mystery', 0.678),('crime',0.12), ('adult',-0.423)],'0005': [('cartoon', -0.89),('comedy', 0.678),('action',0.12)],'0006': [('drama', -0.49),('funny', 0.378),('Suspense',0.12), ('Thriller',0.78)],'0007': [('dark', 0.79),('Mystery', 0.88),('crime',0.32), ('adult',-0.423)]}

To put it in dataframe I do following:

要将其放入数据框中,我执行以下操作:

col_headers = []
entities = []
for key, scores in d.iteritems():
    entities.append(key)
    d[key] = dict(scores)
    col_headers.extend(d[key].keys())
col_headers = list(set(col_headers))

populate the dataframe:

填充数据框:

df = pd.DataFrame(columns=col_headers, index=entities)
for k in d:
    df.loc[k] = pd.Series(d[k])
df.fillna(0.0, axis=1)

One of the issue in addition to my main goal that I have at this point of the code is my dataframe still has NaN. This probably why my result matrix is filled with NaNs.

除了我在代码的这一点上的主要目标之外的问题之一是我的数据帧仍然具有 NaN。这可能是为什么我的结果矩阵被 NaN 填充的原因。

     Mystery drama  kids winter  funny  snow crime  dark sports Suspense  adult skiing action comedy cartoon Thriller
0004   0.678   NaN   NaN    NaN    NaN   NaN  0.12  0.89    NaN      NaN -0.423    NaN    NaN    NaN     NaN      NaN
0005     NaN   NaN   NaN    NaN    NaN   NaN   NaN   NaN    NaN      NaN    NaN    NaN   0.12  0.678   -0.89      NaN
0006     NaN -0.49   NaN    NaN  0.378   NaN   NaN   NaN    NaN     0.12    NaN    NaN    NaN    NaN     NaN     0.78
0007    0.88   NaN   NaN    NaN    NaN   NaN  0.32  0.79    NaN      NaN -0.423    NaN    NaN    NaN     NaN      NaN
0001     NaN   NaN   NaN   0.56    NaN  0.65   NaN   NaN    NaN      NaN    NaN  0.789    NaN    NaN     NaN      NaN
0002     NaN  0.89  0.12  -0.12    NaN   NaN   NaN   NaN    NaN      NaN    NaN    NaN  -0.42  0.678     NaN      NaN
0003     NaN   NaN   NaN    NaN   0.58   NaN   NaN   NaN   0.12      NaN    NaN    NaN   0.89    NaN     NaN      NaN

To calculate cosine similarity and generate the similarity matrix between rows I do following:

要计算余弦相似度并生成行之间的相似度矩阵,我执行以下操作:

data = df.values
m, k = data.shape

mat = np.zeros((m, m))

for i in xrange(m):
    for j in xrange(m):
        if i != j:
            mat[i][j] = cosine(data[i,:], data[j,:])
        else:
            mat[i][j] = 0.

here is how mat looks like:

这是垫子的样子:

[[  0.  nan  nan  nan  nan  nan  nan]
 [ nan   0.  nan  nan  nan  nan  nan]
 [ nan  nan   0.  nan  nan  nan  nan]
 [ nan  nan  nan   0.  nan  nan  nan]
 [ nan  nan  nan  nan   0.  nan  nan]
 [ nan  nan  nan  nan  nan   0.  nan]
 [ nan  nan  nan  nan  nan  nan   0.]]

Assuming NaNissue get fix and matspits out meaning full similarity matrix. How can I get an output as follows:

假设NaN问题得到修复并mat吐出含义完整的相似矩阵。如何获得如下输出:

{0001:[003,005,002],0002:[0001, 0004, 0007]....}

回答by Mika

One of the issue in addition to my main goal that I have at this point of the code is my dataframe still has NaN.

除了我在代码的这一点上的主要目标之外的问题之一是我的数据帧仍然具有 NaN。

That's beacause df.fillnadoes not modify DataFrame, but returns a new one. Fix it and your result will be fine.

这是因为df.fillna不会修改 DataFrame,而是返回一个新的。修复它,你的结果会很好。