pandas 构造相似度矩阵的最有效方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35758612/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Most efficient way to construct similarity matrix
提问by O.rka
I'm using the following links to create a "Euclidean Similarity Matrix" (that I convert to a DataFrame). https://stats.stackexchange.com/questions/53068/euclidean-distance-score-and-similarityhttp://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.euclidean.html
我正在使用以下链接创建“欧几里得相似矩阵”(我将其转换为 DataFrame)。 https://stats.stackexchange.com/questions/53068/euclidean-distance-score-and-similarity http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance .euclidean.html
The way I'm doing it is an iterative approach which works but it takes a while when the datasets are big. The pandas pd.DataFrame.corr() is really fast and useful for pearson correlations.
我这样做的方式是一种迭代方法,它有效,但是当数据集很大时需要一段时间。pandas pd.DataFrame.corr() 对于皮尔逊相关性非常快且有用。
How can I perform a Euclidean Similarity measure w/o exhaustive iteration?
如何在没有详尽迭代的情况下执行欧几里德相似性度量?
My naive code below:
我的天真代码如下:
#Euclidean Similarity
#Create DataFrame
DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]}).T
DF_var.columns = ["g1","g2","g3"]
# g1 g2 g3
# s1 1.2 3.4 10.2
# s2 1.4 3.1 10.7
# s3 2.1 3.7 11.3
# s4 1.5 3.2 10.9
#Create empty matrix to fill
M_euclid = np.zeros((DF_var.shape[1],DF_var.shape[1]))
#Iterate through DataFrame columns to measure euclidean distance
for i in range(DF_var.shape[1]):
u = DF_var[DF_var.columns[i]]
for j in range(DF_var.shape[1]):
v = DF_var[DF_var.columns[j]]
#Euclidean distance -> Euclidean similarity
M_euclid[i,j] = (1/(1+sp.spatial.distance.euclidean(u,v)))
DF_euclid = pd.DataFrame(M_euclid,columns=DF_var.columns,index=DF_var.columns)
# g1 g2 g3
# g1 1.000000 0.215963 0.051408
# g2 0.215963 1.000000 0.063021
# g3 0.051408 0.063021 1.000000
回答by root
There are two useful function within scipy.spatial.distance
that you can use for this: pdist
and squareform
. Using pdist
will give you the pairwise distance between observations as a one-dimensional array, and squareform
will convert this to a distance matrix.
有两个有用的函数scipy.spatial.distance
可以用于此目的:pdist
和squareform
。使用pdist
将为您提供观察之间的成对距离作为一维数组,squareform
并将其转换为距离矩阵。
One catch is that pdist
uses distance measures by default, and not similarity, so you'll need to manually specify your similarity function. Judging by the commented output in your code, your DataFrame is also not in the orientation pdist
expects, so I've undone the transpose you did in your code.
一个问题是pdist
默认情况下使用距离度量,而不是相似度,因此您需要手动指定相似度函数。根据您代码中的注释输出判断,您的 DataFrame 也不在pdist
预期的方向,所以我已经撤消了您在代码中所做的转置。
import pandas as pd
from scipy.spatial.distance import euclidean, pdist, squareform
def similarity_func(u, v):
return 1/(1+euclidean(u,v))
DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]})
DF_var.index = ["g1","g2","g3"]
dists = pdist(DF_var, similarity_func)
DF_euclid = pd.DataFrame(squareform(dists), columns=DF_var.index, index=DF_var.index)
回答by Kevin
I think you can just use pdist
and squareform
to broadcast directly on your DataFrame:
我认为你可以直接在你的 DataFrame 上使用pdist
和squareform
广播:
from scipy.spatial.distance import pdist,squareform
In [6]: squareform(pdist(DF_var, metric='euclidean'))
Out[6]:
array([[ 0. , 0.6164414 , 1.4525839 , 0.78740079],
[ 0.6164414 , 0. , 1.1 , 0.24494897],
[ 1.4525839 , 1.1 , 0. , 0.87749644],
[ 0.78740079, 0.24494897, 0.87749644, 0. ]])
回答by maxymoo
You want scipy.spatial.distance.pdist
or sklearn.metrics.pairwise.pairwise_distances
你想要scipy.spatial.distance.pdist
或sklearn.metrics.pairwise.pairwise_distances
回答by mightypile
The simplest way I can find to get the same result as the OP is to use distance_matrix, also from scipy.spatial. The whole thing can be done in one sort-of-long line.
我能找到的获得与 OP 相同结果的最简单方法是使用distance_matrix,也来自 scipy.spatial。整个事情可以在一个很长的行中完成。
import numpy as np
import pandas as pd
from scipy.spatial import distance_matrix
# Original code from OP, slightly reformatted
DF_var = pd.DataFrame.from_dict({
"s1":[1.2,3.4,10.2],
"s2":[1.4,3.1,10.7],
"s3":[2.1,3.7,11.3],
"s4":[1.5,3.2,10.9]
}).T
DF_var.columns = ["g1","g2","g3"]
# Whole similarity algorithm in one line
df_euclid = pd.DataFrame(
1 / (1 + distance_matrix(DF_var.T, DF_var.T)),
columns=DF_var.columns, index=DF_var.columns
)
# g1 g2 g3
# g1 1.000000 0.215963 0.051408
# g2 0.215963 1.000000 0.063021
# g3 0.051408 0.063021 1.000000
The code above should copy-paste and run in any python IDE.
上面的代码应该复制粘贴并在任何 python IDE 中运行。
回答by Ha Pham
This is what I did:
这就是我所做的:
from scipy.spatial.distance import euclidean
DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]}).T
DF_var.columns = ["g1","g2","g3"]
def m_euclid(v1, v2):
return (1/(1 + euclidean(v1,v2)))
dist_list = []
for j1 in DF_var.columns:
dist_list.append([m_euclid(DF_var[j1], DF_var[j2]) for j2 in DF_var.columns])
dist_matrix = pd.DataFrame(dist_list)