Python sklearn 错误 ValueError:输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31323499/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:48:14  来源:igfitidea点击:

sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

pythonpython-2.7scikit-learnvalueerror

提问by Ethan Waldie

I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error.

我正在使用 sklearn 并且在亲和力传播方面遇到问题。我已经建立了一个输入矩阵,但我不断收到以下错误。

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I have run

我跑了

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

I tried using

我尝试使用

mat[np.isfinite(mat) == True] = 0

to remove the infinite values but this did not work either. What can I do to get rid of the infinite values in my matrix, so that I can use the affinity propagation algorithm?

删除无限值,但这也不起作用。我能做些什么来摆脱矩阵中的无限值,以便我可以使用亲和传播算法?

I am using anaconda and python 2.7.9.

我正在使用 anaconda 和 python 2.7.9。

采纳答案by Marcus Müller

This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.

这可能发生在 scikit 内部,这取决于您在做什么。我建议阅读您正在使用的功能的文档。您可能正在使用一个取决于您的矩阵是否为正定且不满足该标准的矩阵。

EDIT: How could I miss that:

编辑:我怎么能错过:

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

is obviously wrong. Right would be:

显然是错误的。正确的是:

np.any(np.isnan(mat))

and

np.all(np.isfinite(mat))

You want to check wheter any of the element is NaN, and not whether the return value of the anyfunction is a number...

您想检查是否有任何元素为 NaN,而不是any函数的返回值是否为数字...

回答by Ethan Waldie

The Dimensions of my input array were skewed, as my input csv had empty spaces.

我的输入数组的维度是倾斜的,因为我的输入 csv 有空格。

回答by tuxdna

This is the check on which it fails:

这是它失败的检查:

Which says

其中说

def _assert_all_finite(X):
    """Like assert_all_finite, but only for ndarray."""
    X = np.asanyarray(X)
    # First try an O(n) time, O(1) space solution for the common case that
    # everything is finite; fall back to O(n) space np.isfinite to prevent
    # false positives from overflow in sum method.
    if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
            and not np.isfinite(X).all()):
        raise ValueError("Input contains NaN, infinity"
                         " or a value too large for %r." % X.dtype)

So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.

因此,请确保您的输入中有非 NaN 值。所有这些值实际上都是浮点值。这些值都不应该是 Inf。

回答by tekumara

I had the same error, and in my case X and y were dataframes so I had to convert them to matrices first:

我有同样的错误,在我的情况下 X 和 y 是数据帧,所以我必须先将它们转换为矩阵:

X = X.as_matrix().astype(np.float)
y = y.as_matrix().astype(np.float)

回答by Raphvanns

With this version of python 3:

使用这个版本的python 3:

/opt/anaconda3/bin/python --version
Python 3.6.0 :: Anaconda 4.3.0 (64-bit)

Looking at the details of the error, I found the lines of codes causing the failure:

查看错误的详细信息,我找到了导致失败的代码行:

/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     56             and not np.isfinite(X).all()):
     57         raise ValueError("Input contains NaN, infinity"
---> 58                          " or a value too large for %r." % X.dtype)
     59 
     60 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

From this, I was able to extract the correct way to test what was going on with my data using the same test which fails given by the error message: np.isfinite(X)

从这里,我能够提取正确的方法来使用错误消息给出的相同测试来测试我的数据发生了什么: np.isfinite(X)

Then with a quick and dirty loop, I was able to find that my data indeed contains nans:

然后通过一个快速而肮脏的循环,我能够发现我的数据确实包含nans

print(p[:,0].shape)
index = 0
for i in p[:,0]:
    if not np.isfinite(i):
        print(index, i)
    index +=1

(367340,)
4454 nan
6940 nan
10868 nan
12753 nan
14855 nan
15678 nan
24954 nan
30251 nan
31108 nan
51455 nan
59055 nan
...

Now all I have to do is remove the values at these indexes.

现在我要做的就是删除这些索引处的值。

回答by Boern

This is my function (based on this) to clean the dataset of nan, Inf, and missing cells (for skewed datasets):

这是我的功能(基于)清洁的数据集nanInf和缺少细胞(偏斜数据集):

import pandas as pd

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

回答by Jun Wang

I got the same error message when using sklearnwith pandas. My solution is to reset the index of my dataframe dfbefore running any sklearn code:

sklearnpandas一起使用时,我收到了相同的错误消息。我的解决方案是df在运行任何 sklearn 代码之前重置我的数据帧的索引:

df = df.reset_index()

I encountered this issue many times when I removed some entries in my df, such as

当我删除我的一些条目时,我多次遇到这个问题df,例如

df = df[df.label=='desired_one']

回答by Elias Strehle

I had the error after trying to select a subset of rows:

尝试选择行的子集后出现错误:

df = df.reindex(index=my_index)

Turns out that my_indexcontained values that were not contained in df.index, so the reindex function inserted some new rows and filled them with nan.

原来my_index包含的值不包含在 中df.index,因此 reindex 函数插入了一些新行并用nan.

回答by Cohen

i got the same error. it worked with df.fillna(-99999, inplace=True)before doing any replacement, substitution etc

我得到了同样的错误。它df.fillna(-99999, inplace=True)在进行任何替换,替换等之前使用过

回答by luca

In my case the problem was that many scikit functions return numpy arrays, which are devoid of pandas index. So there was an index mismatch when I used those numpy arrays to build new DataFrames and then I tried to mix them with the original data.

就我而言,问题是许多 scikit 函数返回 numpy 数组,而这些数组没有 pandas 索引。因此,当我使用这些 numpy 数组构建新的 DataFrame 时出现索引不匹配,然后我尝试将它们与原始数据混合。