使用 Scikit-learn 进行拟合时出现 Python MemoryError
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16332083/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python MemoryError when doing fitting with Scikit-learn
提问by Nyxynyx
I am running Python 2.7 (64-bit) on a Windows 8 64-bit system with 24GB memory. When doing the fitting of the usual Sklearn.linear_models.Ridge, the code runs fine.
我在具有 24GB 内存的 Windows 8 64 位系统上运行 Python 2.7(64 位)。在进行通常的拟合时Sklearn.linear_models.Ridge,代码运行良好。
Problem:However when using Sklearn.linear_models.RidgeCV(alphas=alphas)for the fitting, I run into the MemoryErrorerror shown below on the line rr.fit(X_train, y_train)that executes the fitting procedure.
问题:但是,当Sklearn.linear_models.RidgeCV(alphas=alphas)用于拟合时,我MemoryError在rr.fit(X_train, y_train)执行拟合程序的行中遇到了如下所示的错误。
How can I prevent this error?
我怎样才能防止这个错误?
Code snippet
代码片段
def fit(X_train, y_train):
alphas = [1e-3, 1e-2, 1e-1, 1e0, 1e1]
rr = RidgeCV(alphas=alphas)
rr.fit(X_train, y_train)
return rr
rr = fit(X_train, y_train)
Error
错误
MemoryError Traceback (most recent call last)
<ipython-input-41-a433716e7179> in <module>()
1 # Fit Training set
----> 2 rr = fit(X_train, y_train)
<ipython-input-35-9650bd58e76c> in fit(X_train, y_train)
3
4 rr = RidgeCV(alphas=alphas)
----> 5 rr.fit(X_train, y_train)
6
7 return rr
C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
696 gcv_mode=self.gcv_mode,
697 store_cv_values=self.store_cv_values)
--> 698 estimator.fit(X, y, sample_weight=sample_weight)
699 self.alpha_ = estimator.alpha_
700 if self.store_cv_values:
C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
608 raise ValueError('bad gcv_mode "%s"' % gcv_mode)
609
--> 610 v, Q, QT_y = _pre_compute(X, y)
611 n_y = 1 if len(y.shape) == 1 else y.shape[1]
612 cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in _pre_compute_svd(self, X, y)
531 def _pre_compute_svd(self, X, y):
532 if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533 X = X.toarray()
534 U, s, _ = np.linalg.svd(X, full_matrices=0)
535 v = s ** 2
C:\Python27\lib\site-packages\scipy\sparse\compressed.pyc in toarray(self, order, out)
559 def toarray(self, order=None, out=None):
560 """See the docstring for `spmatrix.toarray`."""
--> 561 return self.tocoo(copy=False).toarray(order=order, out=out)
562
563 ##############################################################
C:\Python27\lib\site-packages\scipy\sparse\coo.pyc in toarray(self, order, out)
236 def toarray(self, order=None, out=None):
237 """See the docstring for `spmatrix.toarray`."""
--> 238 B = self._process_toarray_args(order, out)
239 fortran = int(B.flags.f_contiguous)
240 if not fortran and not B.flags.c_contiguous:
C:\Python27\lib\site-packages\scipy\sparse\base.pyc in _process_toarray_args(self, order, out)
633 return out
634 else:
--> 635 return np.zeros(self.shape, dtype=self.dtype, order=order)
636
637
MemoryError:
Code
代码
print type(X_train)
print X_train.shape
Result
结果
<class 'scipy.sparse.csr.csr_matrix'>
(183576, 101507)
回答by kwatford
Take a look at this part of your stack trace:
看看堆栈跟踪的这一部分:
531 def _pre_compute_svd(self, X, y):
532 if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533 X = X.toarray()
534 U, s, _ = np.linalg.svd(X, full_matrices=0)
535 v = s ** 2
The algorithm you're using relies on numpy's linear algebra routines to do SVD. But those can't handle sparse matrices, so the author simply converts them to regular non-sparse arrays. The first thing that has to happen for this is to allocate an all-zero array and then fill in the appropriate spots with the values sparsely stored in the sparse matrix. Sounds easy enough, but let's math. A float64 (the default dtype, which you're probably using if you don't know what you're using) element takes 8 bytes. So, based on the array shape you've provided, the new zero-filled array will be:
您使用的算法依赖于 numpy 的线性代数例程来执行 SVD。但是那些不能处理稀疏矩阵,所以作者只是将它们转换为常规的非稀疏数组。为此必须做的第一件事是分配一个全零数组,然后用稀疏存储在稀疏矩阵中的值填充适当的点。听起来很容易,但让我们数学。float64(默认 dtype,如果您不知道您在使用什么,您可能正在使用它)元素需要 8 个字节。因此,根据您提供的数组形状,新的零填充数组将是:
183576 * 101507 * 8 = 149,073,992,256 ~= 150 gigabytes
Your system's memory manager probably took one look at that allocation request and committed suicide. But what can you do about it?
您系统的内存管理器可能看了一眼那个分配请求并自杀了。但是你能做些什么呢?
First off, that looks like a fairly ridiculous number of features. I don't know anything about your problem domain or what your features are, but my gut reaction is that you need to do some dimensionality reduction here.
首先,这看起来是相当可笑的功能数量。我对您的问题域或您的特征一无所知,但我的直觉反应是您需要在这里进行一些降维。
Second, you can try to fix the algorithm's mishandling of sparse matrices. It's choking on numpy.linalg.svdhere, so you might be able to use scipy.sparse.linalg.svdsinstead. I don't know the algorithm in question, but it might not be amenable to sparse matrices. Even if you use the appropriate sparse linear algebra routines, it might produce (or internally use) some non-sparse matrices with sizes similar to your data. Using a sparse matrix representation to represent non-sparse data will only result in using more space than you would have originally, so this approach might not work. Proceed with caution.
其次,您可以尝试修复算法对稀疏矩阵的错误处理。它在numpy.linalg.svd这里窒息,所以你可以scipy.sparse.linalg.svds改用。我不知道有问题的算法,但它可能不适合稀疏矩阵。即使您使用适当的稀疏线性代数例程,它也可能会生成(或在内部使用)一些大小与您的数据相似的非稀疏矩阵。使用稀疏矩阵表示来表示非稀疏数据只会导致使用比最初更多的空间,因此这种方法可能不起作用。谨慎行事。
回答by Mathieu
The relevant option here is gcv_mode. It can take 3 values: "auto", "svd" and "eigen". By default, it is set to "auto", which has the following behavior: use the svd mode if n_samples > n_features, otherwise use the eigen mode.
这里的相关选项是 gcv_mode。它可以采用 3 个值:“auto”、“svd”和“eigen”。默认情况下,它设置为“auto”,它具有以下行为:如果 n_samples > n_features,则使用 svd 模式,否则使用特征模式。
Since in your case n_samples > n_features, the svd mode is chosen. However, the svd mode currently doesn't handle sparse data properly. scikit-learn should be fixed to use proper sparse SVD instead of the dense SVD.
由于在您的情况下 n_samples > n_features,选择了 svd 模式。但是,svd 模式目前不能正确处理稀疏数据。scikit-learn 应该固定使用适当的稀疏 SVD 而不是密集的 SVD。
As a workaround, I would force the eigen mode by gcv_mode="eigen", since this mode should properly handle sparse data. However, n_samples is quite large in your case. Since the eigen mode builds a kernel matrix (and thus has n_samples ** 2 memory complexity), the kernel matrix may not fit in memory. In that case, I would just reduce the number of samples (the eigen mode can handle very large number of features without problem, though).
作为一种解决方法,我会通过 gcv_mode="eigen" 强制使用特征模式,因为这种模式应该正确处理稀疏数据。但是,在您的情况下, n_samples 非常大。由于特征模式构建内核矩阵(因此具有 n_samples ** 2 内存复杂度),内核矩阵可能不适合内存。在这种情况下,我只会减少样本数量(尽管本征模式可以毫无问题地处理大量特征)。
In any case, since both n_samples and n_features are quite large, you are pushing this implementation to its limits (even with a proper sparse SVD).
在任何情况下,由于 n_samples 和 n_features 都非常大,因此您将此实现推到了极限(即使使用适当的稀疏 SVD)。
Also see https://github.com/scikit-learn/scikit-learn/issues/1921

