Python 将 pandas.Series 从 dtype 对象转换为 float,并将错误转换为 nans
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25952790/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert pandas.Series from dtype object to float, and errors to nans
提问by Korem
Consider the following situation:
考虑以下情况:
In [2]: a = pd.Series([1,2,3,4,'.'])
In [3]: a
Out[3]:
0 1
1 2
2 3
3 4
4 .
dtype: object
In [8]: a.astype('float64', raise_on_error = False)
Out[8]:
0 1
1 2
2 3
3 4
4 .
dtype: object
I would have expected an option that allows conversion while turning erroneous values (such as that .) to NaNs. Is there a way to achieve this?
我本来希望有一个选项可以在将错误值(例如 that .)转换为NaNs 的同时进行转换。有没有办法实现这一目标?
采纳答案by cs95
Use pd.to_numericwith errors='coerce'
使用pd.to_numeric与errors='coerce'
# Setup
s = pd.Series(['1', '2', '3', '4', '.'])
s
0 1
1 2
2 3
3 4
4 .
dtype: object
pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
dtype: float64
If you need the NaNs filled in, use Series.fillna.
如果您需要NaN填写 s,请使用Series.fillna.
pd.to_numeric(s, errors='coerce').fillna(0, downcast='infer')
0 1
1 2
2 3
3 4
4 0
dtype: float64
Note, downcast='infer'will attempt to downcast floats to integers where possible. Remove the argument if you don't want that.
注意,downcast='infer'将在可能的情况下尝试将浮点数向下转换为整数。如果您不想要,请删除该参数。
From v0.24+, pandas introduces a Nullable Integertype, which allows integers to coexist with NaNs. If you have integers in your column, you can use
pd.__version__ # '0.24.1' pd.to_numeric(s, errors='coerce').astype('Int32') 0 1 1 2 2 3 3 4 4 NaN dtype: Int32There are other options to choose from as well, read the docs for more.
从 v0.24+ 开始,pandas 引入了Nullable Integer类型,它允许整数与 NaN 共存。如果您的列中有整数,则可以使用
pd.__version__ # '0.24.1' pd.to_numeric(s, errors='coerce').astype('Int32') 0 1 1 2 2 3 3 4 4 NaN dtype: Int32还有其他选项可供选择,请阅读文档了解更多信息。
Extension for DataFrames
扩展为 DataFrames
If you need to extend this to DataFrames, you will need to applyit to each row. You can do this using DataFrame.apply.
如果您需要将此扩展到 DataFrames,则需要将其应用于每一行。您可以使用DataFrame.apply.
# Setup.
np.random.seed(0)
df = pd.DataFrame({
'A' : np.random.choice(10, 5),
'C' : np.random.choice(10, 5),
'B' : ['1', '###', '...', 50, '234'],
'D' : ['23', '1', '...', '268', '$$']}
)[list('ABCD')]
df
A B C D
0 5 1 9 23
1 0 ### 3 1
2 3 ... 5 ...
3 3 50 2 268
4 7 234 4 $$
df.dtypes
A int64
B object
C int64
D object
dtype: object
df2 = df.apply(pd.to_numeric, errors='coerce')
df2
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
df2.dtypes
A int64
B float64
C int64
D float64
dtype: object
You can also do this with DataFrame.transform; although my tests indicate this is marginally slower:
你也可以用DataFrame.transform; 虽然我的测试表明这稍微慢了一点:
df.transform(pd.to_numeric, errors='coerce')
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
If you have many columns (numeric; non-numeric), you can make this a little more performant by applying pd.to_numericon the non-numeric columns only.
如果您有很多列(数字;非数字),您可以通过仅应用pd.to_numeric非数字列来提高性能。
df.dtypes.eq(object)
A False
B True
C False
D True
dtype: bool
cols = df.columns[df.dtypes.eq(object)]
# Actually, `cols` can be any list of columns you need to convert.
cols
# Index(['B', 'D'], dtype='object')
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
# Alternatively,
# for c in cols:
# df[c] = pd.to_numeric(df[c], errors='coerce')
df
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
Applying pd.to_numericalong the columns (i.e., axis=0, the default) should be slightly faster for long DataFrames.
对于长数据帧pd.to_numeric,沿列应用(即,axis=0默认值)应该稍微快一些。
回答by Jeff
In [30]: pd.Series([1,2,3,4,'.']).convert_objects(convert_numeric=True)
Out[30]:
0 1
1 2
2 3
3 4
4 NaN
dtype: float64

