Python sklearn 的 PLSRegression:“ValueError:数组不能包含 infs 或 NaN”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33447808/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:22:51  来源:igfitidea点击:

sklearn's PLSRegression: "ValueError: array must not contain infs or NaNs"

pythonscikit-learnlinear-regression

提问by Franck Dernoncourt

When using sklearn.cross_decomposition.PLSRegression:

使用时sklearn.cross_decomposition.PLSRegression

import numpy as np
import sklearn.cross_decomposition

pls2 = sklearn.cross_decomposition.PLSRegression()
xx = np.random.random((5,5))
yy = np.zeros((5,5) ) 

yy[0,:] = [0,1,0,0,0]
yy[1,:] = [0,0,0,1,0]
yy[2,:] = [0,0,0,0,1]
#yy[3,:] = [1,0,0,0,0] # Uncommenting this line solves the issue

pls2.fit(xx, yy)

I get:

我得到:

C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py:44: RuntimeWarning: invalid value encountered in divide
  x_weights = np.dot(X.T, y_score) / np.dot(y_score.T, y_score)
C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py:64: RuntimeWarning: invalid value encountered in less
  if np.dot(x_weights_diff.T, x_weights_diff) < tol or Y.shape[1] == 1:
C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py:67: UserWarning: Maximum number of iterations reached
  warnings.warn('Maximum number of iterations reached')
C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py:297: RuntimeWarning: invalid value encountered in less
  if np.dot(x_scores.T, x_scores) < np.finfo(np.double).eps:
C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py:275: RuntimeWarning: invalid value encountered in less
  if np.all(np.dot(Yk.T, Yk) < np.finfo(np.double).eps):
Traceback (most recent call last):
  File "C:\svn\hw4\code\test_plsr2.py", line 8, in <module>
    pls2.fit(xx, yy)
  File "C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py", line 335, in fit
    linalg.pinv(np.dot(self.x_loadings_.T, self.x_weights_)))
  File "C:\Anaconda\lib\site-packages\scipy\linalg\basic.py", line 889, in pinv
    a = _asarray_validated(a, check_finite=check_finite)
  File "C:\Anaconda\lib\site-packages\scipy\_lib\_util.py", line 135, in _asarray_validated
    a = np.asarray_chkfinite(a)
  File "C:\Anaconda\lib\site-packages\numpy\lib\function_base.py", line 613, in asarray_chkfinite
    "array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs

What could be the issue?

可能是什么问题?

I am aware of scikit-learn GitHub issue #2089, but since I use scikit-learn 0.16.1 (with Python 2.7.10 x64) this problem should be solved (the code snippets mentioned in the GitHub issue work fine).

我知道scikit-learn GitHub 问题 #2089,但由于我使用 scikit-learn 0.16.1(使用 Python 2.7.10 x64)这个问题应该得到解决(GitHub 问题中提到的代码片段工作正常)。

采纳答案by Franck Dernoncourt

The issue is caused by a bug in scikit-learn. I reported it on GitHub: https://github.com/scikit-learn/scikit-learn/issues/2089#issuecomment-152753095

该问题是由 scikit-learn 中的错误引起的。我在 GitHub 上报告过:https: //github.com/scikit-learn/scikit-learn/issues/2089#issuecomment-152753095

回答by eickenberg

Please check if any of your values being passed in are NaN or inf:

请检查您传入的任何值是否为 NaN 或 inf:

np.isnan(xx).any()
np.isnan(yy).any()

np.isinf(xx).any()
np.isinf(yy).any()

If any of those yields true. Remove the nanentries or inf entries. E.g. you can set them to 0with:

如果其中任何一个结果为真。删除nan条目或 inf 条目。例如,您可以将它们设置为0

xx = np.nan_to_num(xx)
yy = np.nan_to_num(yy)

It's also possible for numpy to be fed such large positive and negative and zeroed values, that the equations deep down in the library are producing zeros, Nan's or Inf's. One workaround, oddly enough, is to send in smaller numbers (say representative numbers between -1 and 1. One way to do this is by standardization, see: https://stackoverflow.com/a/36390482/445131

numpy 也有可能被输入如此大的正负值和零值,以至于库深处的方程产生零、Nan 或 Inf。奇怪的是,一种解决方法是发送较小的数字(比如 -1 和 1 之间的代表数字。一种方法是通过标准化,请参阅:https: //stackoverflow.com/a/36390482/445131

If none of that solves the problem, then you may be dealing with a low level bug in the library your using, or some sort of singularity in your data. Create an sscceand post it to stackoverflow or create a new bug report on the library maintaining your software.

如果这些都不能解决问题,那么您可能正在处理您使用的库中的低级错误,或者数据中的某种奇异性。创建一个sscce并将其发布到 stackoverflow 或创建关于维护您的软件的库的新错误报告。

回答by Charles Chow

I can reproduce the same bug, I silenced this bug by filtering all 0s away

我可以重现相同的错误,我通过过滤掉所有0s 来消除这个错误

threshold_for_bug = 0.00000001 # could be any value, ex numpy.min
xx[xx < threshold_for_bug] = threshold_for_bug

This silences the bug (i never check the precision difference)

这使错误静音(我从不检查精度差异)

My system info:

我的系统信息:

numpy-1.11.2
python-3.5
macOS Sierra

回答by Skippy le Grand Gourou

You may want to check your weights for negative values, since this error will also be triggered with negative weights.

您可能想要检查负值的权重,因为负权重也会触发此错误。

回答by Warlax56

I found a tricky little solution that worked for me.

我找到了一个对我有用的棘手的小解决方案。

I was doing time series featurization through cesium with this code:

我正在使用以下代码通过 cesium 进行时间序列特征化:

timeInput = np.array(timeData)
valueInput = np.array(data)

#Featurizing Data
featurizedData = featurize.featurize_time_series(times=timeInput,
                                                     values=valueInput,
                                                     errors=None,
                                                     features_to_use=featuresToUse)

which was resulting in this error:

这导致了这个错误:

ValueError: array must not contain infs or NaNs

for laughs, I checked the lengths and types of the data:

为了笑,我检查了数据的长度和类型:

data:
70
<class 'numpy.int32'>

timeData: 
70
<class 'numpy.float64'>

which made sense, because my times were calculated from delta data in ms.

这是有道理的,因为我的时间是根据以毫秒为单位的增量数据计算的。

I decided I'd try to convert data types with this one line of code:

我决定尝试使用这一行代码转换数据类型:

valueInput = valueInput.astype(float)

and it worked, resulting in this code:

它起作用了,产生了这个代码:

timeInput = np.array(timeData)
valueInput = np.array(data)
valueInput = valueInput.astype(float)

#Featurizing Data
try:
    featurizedData = featurize.featurize_time_series(times=timeInput,
                                                     values=valueInput,
                                                     errors=None,
                                                     features_to_use=featuresToUse)

if you're getting an error like this, give matching datatypes a shot

如果您收到这样的错误,请尝试匹配数据类型