Numpy、Pandas 和 Sklearn 中的多维缩放拟合(ValueError)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16990996/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:53:35  来源:igfitidea点击:

Multidimensional Scaling Fitting in Numpy, Pandas and Sklearn (ValueError)

pythonnumpypandasscikit-learn

提问by David Williams

I'm trying out multidimensional scaling with sklearn, pandas and numpy. The data file Im using has 10 numerical columns and no missing values. I am trying to take this ten dimensional data and visualize it in 2 dimensions with sklearn.manifold's multidimensional scaling as follows:

我正在尝试使用 sklearn、pandas 和 numpy 进行多维缩放。我使用的数据文件有 10 个数字列并且没有缺失值。我正在尝试使用 sklearn.manifold 的多维缩放将这 10 维数据可视化为 2 维,如下所示:

import numpy as np
import pandas as pd
from sklearn import manifold
from sklearn.metrics import euclidean_distances

seed = np.random.RandomState(seed=3)
data = pd.read_csv('data/big-file.csv')

#  start small dont take all the data, 
#  its about 200k records
subset = data[:10000]
similarities = euclidean_distances(subset)

mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9, 
      random_state=seed, dissimilarity="precomputed", n_jobs=1)

pos = mds.fit(similarities).embedding_

But I get this value error:

但我得到这个值错误:

Traceback (most recent call last):
  File "demo/mds-demo.py", line 18, in <module>
    pos = mds.fit(similarities).embedding_
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 360, in fit
    self.fit_transform(X, init=init)
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 395, in fit_transform
eps=self.eps, random_state=self.random_state)
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 242, in smacof
eps=eps, random_state=random_state)
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 73, in _smacof_single
raise ValueError("similarities must be symmetric")
ValueError: similarities must be symmetric

I thought euclidean_distances returned a symmetric matrix. What am I doing wrong and how do I fix it?

我认为 euclidean_distances 返回了一个对称矩阵。我做错了什么,我该如何解决?

回答by Josh Rosen

I ran across the same problem; it turned out that my data was an array of np.float32and the reduced float precision caused the distance matrix to be asymmetric. I fixed the issue by converting my data to np.float64before running MDS on it.

我遇到了同样的问题;结果证明我的数据是一个数组,np.float32并且降低的浮点精度导致距离矩阵不对称。我通过np.float64在其上运行 MDS 之前将我的数据转换为解决了该问题。

Here's an example that uses random data to illustrate the issue:

这是一个使用随机数据来说明问题的示例:

import numpy as np
from sklearn.manifold import MDS
from sklearn.metrics import euclidean_distances
from sklearn.datasets import make_classification

data, labels = make_classification()
mds = MDS(n_components=2)

similarities = euclidean_distances(data.astype(np.float64))
print np.abs(similarities - similarities.T).max()
# Prints 1.7763568394e-15
mds.fit(data.astype(np.float64))
# Succeeds

similarities = euclidean_distances(data.astype(np.float32))
print np.abs(similarities - similarities.T).max()
# Prints 9.53674e-07
mds.fit(data.astype(np.float32))
# Fails with "ValueError: similarities must be symmetric"

回答by memecs

Had the same problem a while ago. Another solution, which I believe much more efficient, is to compute the distance only for the upper triangular matrix, and later copy to the lower part.

前一阵子也有同样的问题。我认为更有效的另一种解决方案是仅计算上三角矩阵的距离,然后复制到下部分。

It can be done with scipy as follows:

它可以用 scipy 完成,如下所示:

from scipy.spatial.distance import squareform,pdist                                                              
similarities = squareform(pdist(data,'speuclidean'))