Python 在 Pandas DataFrame 中用 None 替换无效值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17097236/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:26:30  来源:igfitidea点击:

Replace invalid values with None in Pandas DataFrame

pythonpandasdataframereplacenan

提问by Blaszard

Is there any method to replace values with Nonein Pandas in Python?

有没有什么方法可以用NonePython 中的 Pandas替换值?

You can use df.replace('pre', 'post')and can replace a value with another, but this can't be done if you want to replace with Nonevalue, which if you try, you get a strange result.

您可以使用df.replace('pre', 'post')并且可以用另一个值替换一个值,但是如果您想用None值替换,则无法做到这一点,如果您尝试这样做,您会得到一个奇怪的结果。

So here's an example:

所以这是一个例子:

df = DataFrame(['-',3,2,5,1,-5,-1,'-',9])
df.replace('-', 0)

which returns a successful result.

返回一个成功的结果。

But,

但,

df.replace('-', None)

which returns a following result:

返回以下结果:

0
0   - // this isn't replaced
1   3
2   2
3   5
4   1
5  -5
6  -1
7  -1 // this is changed to `-1`...
8   9

Why does such a strange result be returned?

为什么会返回这么奇怪的结果?

Since I want to pour this data frame into MySQL database, I can't put NaNvalues into any element in my data frame and instead want to put None. Surely, you can first change '-'to NaNand then convert NaNto None, but I want to know why the dataframe acts in such a terrible way.

由于我想将此数据框倒入 MySQL 数据库中,因此我无法将NaN值放入数据框中的任何元素中,而是想将None. 当然,您可以先更改'-'NaN,然后再转换NaNNone,但我想知道为什么数据框的行为如此糟糕。

Tested on pandas 0.12.0 dev on Python 2.7 and OS X 10.8. Python is a pre-installed version on OS X and I installed pandas by using SciPy Superpack script, for your information.

在 Python 2.7 和 OS X 10.8 上的 pandas 0.12.0 dev 上测试。Python 是 OS X 上的预安装版本,我使用 SciPy Superpack 脚本安装了 Pandas,供您参考。

采纳答案by Andy Hayden

Actually in later versions of pandas this will give a TypeError:

实际上在更高版本的熊猫中,这将给出一个类型错误:

df.replace('-', None)
TypeError: If "to_replace" and "value" are both None then regex must be a mapping

You can do it by passing either a list or a dictionary:

您可以通过传递列表或字典来实现:

In [11]: df.replace('-', df.replace(['-'], [None]) # or .replace('-', {0: None})
Out[11]:
      0
0  None
1     3
2     2
3     5
4     1
5    -5
6    -1
7  None
8     9

But I recommend using NaNs rather than None:

但我建议使用 NaN 而不是 None:

In [12]: df.replace('-', np.nan)
Out[12]:
     0
0  NaN
1    3
2    2
3    5
4    1
5   -5
6   -1
7  NaN
8    9

回答by user2966041

whereis probably what you're looking for. So

where可能就是你要找的。所以

data=data.where(data=='-', None) 

From the panda docs:

来自熊猫文档

where[returns] an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other).

where[返回] 一个与 self 形状相同的对象,其对应的条目来自 self,其中 cond 为 True,否则来自 other)。

回答by Michael Dorner

I prefer the solution using replacewith a dictbecause of its simplicity and elegance:

我更喜欢使用replacea的解决方案,dict因为它简单而优雅:

df.replace({'-': None})

You can also have more replacements:

您还可以有更多的替代品:

df.replace({'-': None, 'None': None})

And even for larger replacements, it is always obvious and clear what is replaced by what - which is way harder for long lists, in my opinion.

即使对于较大的替换,也总是很明显和清楚什么被什么替换 - 在我看来,这对于长列表来说更难。

回答by Shravan kp

df = pd.DataFrame(['-',3,2,5,1,-5,-1,'-',9])
df = df.where(df!='-', None)

回答by Freek Wiekmeijer

Setting null values can be done with np.nan:

可以通过以下方式设置空值np.nan

import numpy as np
df.replace('-', np.nan)

Advantage is that df.last_valid_index()recognizes these as invalid.

优点是将df.last_valid_index()这些识别为无效。

回答by cs95

Before proceeding with this post, it is important to understand the difference between NaN and None. One is a float type, the other is an object type. Pandas is better suited to working with scalar types as many methods on these types can be vectorised. Pandas does try to handle None and NaN consistently, but NumPy cannot.

在继续这篇文章之前,了解NaN 和 None 之间的区别很重要。一种是浮点类型,另一种是对象类型。Pandas 更适合处理标量类型,因为这些类型的许多方法都可以向量化。Pandas 确实尝试一致地处理 None 和 NaN,但 NumPy 不能。

My suggestion (and Andy's) is to stick with NaN.

我的建议(和 Andy 的)是坚持使用 NaN。

But to answer your question...

但是要回答你的问题...

pandas >= 0.18: Use na_values=['-']argument with read_csv

pandas >= 0.18:使用na_values=['-']参数read_csv

If you loaded this data from CSV/Excel, I have good news for you. You can quash this at the root during data loading instead of having to write a fix with code as a subsequent step.

如果您从 CSV/Excel 加载此数据,我有好消息要告诉您。您可以在数据加载期间在根处取消此设置,而不必在后续步骤中编写带有代码的修复程序。

Most of the pd.read_*functions (such as read_csvand read_excel) accept a na_valuesattribute.

大多数pd.read_*函数(例如read_csvread_excel)都接受一个na_values属性。

file.csv

file.csv

A,B
-,1
3,-
2,-
5,3
1,-2
-5,4
-1,-1
-,0
9,0

Now, to convert the -characters into NaNs, do,

现在,要将-字符转换为 NaN,请执行以下操作:

import pandas as pd
df = pd.read_csv('file.csv', na_values=['-'])
df

     A    B
0  NaN  1.0
1  3.0  NaN
2  2.0  NaN
3  5.0  3.0
4  1.0 -2.0
5 -5.0  4.0
6 -1.0 -1.0
7  NaN  0.0
8  9.0  0.0

And similar for other functions/file formats.

其他功能/文件格式也类似。

P.S.: On v0.24+, you can preserve integer type even if your column has NaNs (yes, talk about having the cake and eating it too). You can specify dtype='Int32'

PS:在 v0.24+ 上,即使您的列有 NaN,您也可以保留整数类型(是的,谈论吃蛋糕和吃蛋糕)。您可以指定dtype='Int32'

df = pd.read_csv('file.csv', na_values=['-'], dtype='Int32')
df

     A    B
0  NaN    1
1    3  NaN
2    2  NaN
3    5    3
4    1   -2
5   -5    4
6   -1   -1
7  NaN    0
8    9    0

df.dtypes

A    Int32
B    Int32
dtype: object

The dtype is not a conventional int type... but rather, a Nullable Integer Type.There are other options.

dtype 不是传统的 int 类型……而是一个Nullable Integer 类型。还有其他选择。



Handling Numeric Data: pd.to_numericwith errors='coerce

处理数字数据:pd.to_numericerrors='coerce

If you're dealing with numeric data, a faster solution is to use pd.to_numericwith the errors='coerce'argument, which coerces invalid values (values that cannot be cast to numeric) to NaN.

如果你正在处理的数字数据,更快的解决方案是使用pd.to_numericerrors='coerce'参数,它强制转换无效值(值不能转换为数字)为NaN。

pd.to_numeric(df['A'], errors='coerce')

0    NaN
1    3.0
2    2.0
3    5.0
4    1.0
5   -5.0
6   -1.0
7    NaN
8    9.0
Name: A, dtype: float64

To retain (nullable) integer dtype, use

要保留(可为空)整数 dtype,请使用

pd.to_numeric(df['A'], errors='coerce').astype('Int32')

0    NaN
1      3
2      2
3      5
4      1
5     -5
6     -1
7    NaN
8      9
Name: A, dtype: Int32 

To coerce multiple columns, use apply:

要强制多列,请使用apply

df[['A', 'B']].apply(pd.to_numeric, errors='coerce').astype('Int32')

     A    B
0  NaN    1
1    3  NaN
2    2  NaN
3    5    3
4    1   -2
5   -5    4
6   -1   -1
7  NaN    0
8    9    0

...and assign the result back after.

...并在之后分配结果。

More information can be found in this answer.

更多信息可以在这个答案中找到。

回答by daniel rocha

Using replace and assigning a new df:

使用替换并分配一个新的 df:

import pandas as pd
df = pd.DataFrame(['-',3,2,5,1,-5,-1,'-',9])
dfnew = df.replace('-', 0)
print(dfnew)


(venv) D:\assets>py teste2.py
   0
0  0
1  3
2  2
3  5
4  1
5 -5

回答by Keng Chan

df.replace('-', np.nan).astype("object")

This will ensure that you can use isnull()later on your dataframe

这将确保您isnull()稍后可以在数据框上使用