返回 inf 的 Pandas DataFrame 列的 mean()：我该如何解决这个问题？

Question

提问by Augusto Ribas

I'm trying to implement some machine learning algorithms, but I'm having some difficulties putting the data together.

我正在尝试实现一些机器学习算法，但在将数据放在一起时遇到了一些困难。

In the example below, I load a example data-set from UCI, remove lines with missing data (thanks to the help from a previous question), and now I would like to try to normalize the data.

在下面的示例中，我从 UCI 加载示例数据集，删除缺少数据的行（感谢上一个问题的帮助），现在我想尝试对数据进行标准化。

For many datasets, I just used:

对于许多数据集，我只是使用了：

valores = (valores - valores.mean()) / (valores.std())

But for this particular dataset the approach above doesn't work. The problem is that the mean function is returning inf, perhaps due to a precision issue. See the example below:

但是对于这个特定的数据集，上面的方法不起作用。问题是 mean 函数正在返回inf，可能是由于精度问题。请参阅以下示例：

bcw = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)

for col in bcw.columns:
    if bcw[col].dtype != 'int64':
        print "Removendo possivel '?' na coluna %s..." % col
        bcw = bcw[bcw[col] != '?']

valores = bcw.iloc[:,1:10]
#mean return inf
print  valores.iloc[:,5].mean()

My question is how to deal with this. It seems that I need to change the type of this column, but I don't know how to do it.

我的问题是如何处理这个问题。好像需要改一下这个专栏的类型，但是不知道怎么弄。

Answer 1

采纳答案by Dave

not so familiar with pandas but if you convert to a numpy array it works, try

对Pandas不太熟悉，但如果您转换为 numpy 数组，它可以工作，请尝试

np.asarray(valores.iloc[:,5], dtype=np.float).mean()

Answer 2

回答by ali_m

NaNvalues should not matter when computing the mean of a pandas.Series. Precision is also irrelevant. The only explanation I can think of is that one of the values in valoresis equal to infinity.

NaN计算 a 的平均值时，值应该无关紧要pandas.Series。精度也无关紧要。我能想到的唯一解释是其中的一个值valores等于无穷大。

You could exclude any values that are infinite when computing the mean like this:

在计算平均值时，您可以排除任何无限的值，如下所示：

import numpy as np

is_inf = valores.iloc[:, 5] == np.inf
valores.ix[~is_inf, 5].mean()

Answer 3

回答by gil.fernandes

If the elements of the pandas series are strings you get infand the mean result. In this specific case you can simply convert the pandas series elements to floatand then calculate the mean. No need to use numpy.

如果 Pandas 系列的元素是你得到的字符串inf和平均结果。在这种特定情况下，您可以简单地将Pandas系列元素转换为float，然后计算平均值。无需使用 numpy。

Example:

例子：

valores.iloc[:,5].astype(float).mean()

Answer 4

回答by BrotherHyman

I had the same problem with a column that was of dtype 'o', and whose max value was 9999. Have you tried using the convert_objectsmethod with the convert_numeric=Trueparameter? This fixed the problem for me.

我对一个 dtype 'o' 的列遇到了同样的问题，其最大值为 9999。您是否尝试过使用convert_objects带有convert_numeric=True参数的方法？这为我解决了问题。

返回 inf 的 Pandas DataFrame 列的 mean()：我该如何解决这个问题？

提问by Augusto Ribas

采纳答案by Dave

回答by ali_m

回答by gil.fernandes

回答by BrotherHyman

相关推荐

最近更新

标签

返回 inf 的 Pandas DataFrame 列的 mean()：我该如何解决这个问题？

提问by Augusto Ribas

采纳答案by Dave

回答by ali_m

回答by gil.fernandes

回答by BrotherHyman

相关推荐

pandas 使用 .concat 创建熊猫数据框时包含空系列

pandas DataFrame 在布尔掩码上设置值

Pandas 错误 - 遇到无效值

从通过 Pandas 创建的 html 表中删除边框

相关推荐

最近更新

标签