Numpy、Pandas 和 Sklearn 中的多维缩放拟合（ValueError）

Question

提问by David Williams

I'm trying out multidimensional scaling with sklearn, pandas and numpy. The data file Im using has 10 numerical columns and no missing values. I am trying to take this ten dimensional data and visualize it in 2 dimensions with sklearn.manifold's multidimensional scaling as follows:

我正在尝试使用 sklearn、pandas 和 numpy 进行多维缩放。我使用的数据文件有 10 个数字列并且没有缺失值。我正在尝试使用 sklearn.manifold 的多维缩放将这 10 维数据可视化为 2 维，如下所示：

import numpy as np
import pandas as pd
from sklearn import manifold
from sklearn.metrics import euclidean_distances

seed = np.random.RandomState(seed=3)
data = pd.read_csv('data/big-file.csv')

#  start small dont take all the data, 
#  its about 200k records
subset = data[:10000]
similarities = euclidean_distances(subset)

mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9, 
      random_state=seed, dissimilarity="precomputed", n_jobs=1)

pos = mds.fit(similarities).embedding_

But I get this value error:

但我得到这个值错误：

Traceback (most recent call last):
  File "demo/mds-demo.py", line 18, in <module>
    pos = mds.fit(similarities).embedding_
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 360, in fit
    self.fit_transform(X, init=init)
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 395, in fit_transform
eps=self.eps, random_state=self.random_state)
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 242, in smacof
eps=eps, random_state=random_state)
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 73, in _smacof_single
raise ValueError("similarities must be symmetric")
ValueError: similarities must be symmetric

I thought euclidean_distances returned a symmetric matrix. What am I doing wrong and how do I fix it?

我认为 euclidean_distances 返回了一个对称矩阵。我做错了什么，我该如何解决？

Answer 1

回答by Josh Rosen

I ran across the same problem; it turned out that my data was an array of np.float32and the reduced float precision caused the distance matrix to be asymmetric. I fixed the issue by converting my data to np.float64before running MDS on it.

我遇到了同样的问题；结果证明我的数据是一个数组，np.float32并且降低的浮点精度导致距离矩阵不对称。我通过np.float64在其上运行 MDS 之前将我的数据转换为解决了该问题。

Here's an example that uses random data to illustrate the issue:

这是一个使用随机数据来说明问题的示例：

import numpy as np
from sklearn.manifold import MDS
from sklearn.metrics import euclidean_distances
from sklearn.datasets import make_classification

data, labels = make_classification()
mds = MDS(n_components=2)

similarities = euclidean_distances(data.astype(np.float64))
print np.abs(similarities - similarities.T).max()
# Prints 1.7763568394e-15
mds.fit(data.astype(np.float64))
# Succeeds

similarities = euclidean_distances(data.astype(np.float32))
print np.abs(similarities - similarities.T).max()
# Prints 9.53674e-07
mds.fit(data.astype(np.float32))
# Fails with "ValueError: similarities must be symmetric"

Answer 2

回答by memecs

Had the same problem a while ago. Another solution, which I believe much more efficient, is to compute the distance only for the upper triangular matrix, and later copy to the lower part.

前一阵子也有同样的问题。我认为更有效的另一种解决方案是仅计算上三角矩阵的距离，然后复制到下部分。

It can be done with scipy as follows:

它可以用 scipy 完成，如下所示：

from scipy.spatial.distance import squareform,pdist                                                              
similarities = squareform(pdist(data,'speuclidean'))

Numpy、Pandas 和 Sklearn 中的多维缩放拟合（ValueError）

提问by David Williams

回答by Josh Rosen

回答by memecs

相关推荐

最近更新

标签

Numpy、Pandas 和 Sklearn 中的多维缩放拟合（ValueError）

提问by David Williams

回答by Josh Rosen

回答by memecs

相关推荐

Pandas 中非唯一索引的性能影响是什么？

pandas 无法将 DataFrame 保存到 HDF5（“对象头消息太大”）

pandas HDF5 比 CSV 占用更多空间？

Pandas 按数据框上的操作分组

相关推荐

最近更新

标签