pandas 构造相似度矩阵的最有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35758612/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:48:44  来源:igfitidea点击:

Most efficient way to construct similarity matrix

pythonnumpypandasmatrixscipy

提问by O.rka

I'm using the following links to create a "Euclidean Similarity Matrix" (that I convert to a DataFrame). https://stats.stackexchange.com/questions/53068/euclidean-distance-score-and-similarityhttp://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.euclidean.html

我正在使用以下链接创建“欧几里得相似矩阵”(我将其转换为 DataFrame)。 https://stats.stackexchange.com/questions/53068/euclidean-distance-score-and-similarity http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance .euclidean.html

The way I'm doing it is an iterative approach which works but it takes a while when the datasets are big. The pandas pd.DataFrame.corr() is really fast and useful for pearson correlations.

我这样做的方式是一种迭代方法,它有效,但是当数据集很大时需要一段时间。pandas pd.DataFrame.corr() 对于皮尔逊相关性非常快且有用。

How can I perform a Euclidean Similarity measure w/o exhaustive iteration?

如何在没有详尽迭代的情况下执行欧几里德相似性度量?

My naive code below:

我的天真代码如下:

#Euclidean Similarity

#Create DataFrame
DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]}).T
DF_var.columns = ["g1","g2","g3"]
#      g1   g2    g3
# s1  1.2  3.4  10.2
# s2  1.4  3.1  10.7
# s3  2.1  3.7  11.3
# s4  1.5  3.2  10.9

#Create empty matrix to fill
M_euclid = np.zeros((DF_var.shape[1],DF_var.shape[1]))

#Iterate through DataFrame columns to measure euclidean distance
for i in range(DF_var.shape[1]):
    u = DF_var[DF_var.columns[i]]
    for j in range(DF_var.shape[1]):
        v = DF_var[DF_var.columns[j]]
        #Euclidean distance -> Euclidean similarity
        M_euclid[i,j] = (1/(1+sp.spatial.distance.euclidean(u,v)))
DF_euclid = pd.DataFrame(M_euclid,columns=DF_var.columns,index=DF_var.columns)

#           g1        g2        g3
# g1  1.000000  0.215963  0.051408
# g2  0.215963  1.000000  0.063021
# g3  0.051408  0.063021  1.000000

回答by root

There are two useful function within scipy.spatial.distancethat you can use for this: pdistand squareform. Using pdistwill give you the pairwise distance between observations as a one-dimensional array, and squareformwill convert this to a distance matrix.

有两个有用的函数scipy.spatial.distance可以用于此目的:pdistsquareform。使用pdist将为您提供观察之间的成对距离作为一维数组,squareform并将其转换为距离矩阵。

One catch is that pdistuses distance measures by default, and not similarity, so you'll need to manually specify your similarity function. Judging by the commented output in your code, your DataFrame is also not in the orientation pdistexpects, so I've undone the transpose you did in your code.

一个问题是pdist默认情况下使用距离度量,而不是相似度,因此您需要手动指定相似度函数。根据您代码中的注释输出判断,您的 DataFrame 也不在pdist预期的方向,所以我已经撤消了您在代码中所做的转置。

import pandas as pd
from scipy.spatial.distance import euclidean, pdist, squareform


def similarity_func(u, v):
    return 1/(1+euclidean(u,v))

DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]})
DF_var.index = ["g1","g2","g3"]

dists = pdist(DF_var, similarity_func)
DF_euclid = pd.DataFrame(squareform(dists), columns=DF_var.index, index=DF_var.index)

回答by Kevin

I think you can just use pdistand squareformto broadcast directly on your DataFrame:

我认为你可以直接在你的 DataFrame 上使用pdistsquareform广播:

from scipy.spatial.distance import pdist,squareform

In [6]: squareform(pdist(DF_var, metric='euclidean'))

Out[6]:
array([[ 0.        ,  0.6164414 ,  1.4525839 ,  0.78740079],
       [ 0.6164414 ,  0.        ,  1.1       ,  0.24494897],
       [ 1.4525839 ,  1.1       ,  0.        ,  0.87749644],
       [ 0.78740079,  0.24494897,  0.87749644,  0.        ]])

回答by maxymoo

You want scipy.spatial.distance.pdistor sklearn.metrics.pairwise.pairwise_distances

你想要scipy.spatial.distance.pdistsklearn.metrics.pairwise.pairwise_distances

回答by mightypile

The simplest way I can find to get the same result as the OP is to use distance_matrix, also from scipy.spatial. The whole thing can be done in one sort-of-long line.

我能找到的获得与 OP 相同结果的最简单方法是使用distance_matrix,也来自 scipy.spatial。整个事情可以在一个很长的行中完成。

import numpy as np
import pandas as pd
from scipy.spatial import distance_matrix

# Original code from OP, slightly reformatted
DF_var = pd.DataFrame.from_dict({
    "s1":[1.2,3.4,10.2],
    "s2":[1.4,3.1,10.7],
    "s3":[2.1,3.7,11.3],
    "s4":[1.5,3.2,10.9]
}).T
DF_var.columns = ["g1","g2","g3"]

# Whole similarity algorithm in one line
df_euclid = pd.DataFrame(
    1 / (1 + distance_matrix(DF_var.T, DF_var.T)),
    columns=DF_var.columns, index=DF_var.columns
)

#           g1        g2        g3
# g1  1.000000  0.215963  0.051408
# g2  0.215963  1.000000  0.063021
# g3  0.051408  0.063021  1.000000

The code above should copy-paste and run in any python IDE.

上面的代码应该复制粘贴并在任何 python IDE 中运行。

回答by Ha Pham

This is what I did:

这就是我所做的:

from scipy.spatial.distance import euclidean

DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]}).T
DF_var.columns = ["g1","g2","g3"]

def m_euclid(v1, v2):
    return (1/(1 + euclidean(v1,v2)))

dist_list = []
for j1 in DF_var.columns:
    dist_list.append([m_euclid(DF_var[j1], DF_var[j2]) for j2 in DF_var.columns])

dist_matrix = pd.DataFrame(dist_list)