Python 熊猫将字符串转换为整数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42719749/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas convert string to int
提问by gmarais
I have a large dataframe with ID numbers:
我有一个带有 ID 号的大数据框:
ID.head()
Out[64]:
0 4806105017087
1 4806105017087
2 4806105017087
3 4901295030089
4 4901295030089
These are all strings at the moment.
目前这些都是字符串。
I want to convert to int
without using loops - for this I use ID.astype(int)
.
我想在int
不使用循环的情况下转换为- 为此我使用ID.astype(int)
.
The problem is that some of my lines contain dirty data which cannot be converted to int
, for e.g.
问题是我的某些行包含无法转换为的脏数据int
,例如
ID[154382]
Out[58]: 'CN414149'
How can I (without using loops) remove these type of occurrences so that I can use astype
with peace of mind?
我如何(不使用循环)删除这些类型的事件,以便我可以astype
安心使用?
回答by jezrael
You need add parameter errors='coerce'
to function to_numeric
:
您需要errors='coerce'
向函数添加参数to_numeric
:
ID = pd.to_numeric(ID, errors='coerce')
If ID
is column:
如果ID
是列:
df.ID = pd.to_numeric(df.ID, errors='coerce')
but non numeric are converted to NaN
, so all values are float
.
但非数字被转换为NaN
,所以所有值都是float
。
For int
need convert NaN
to some value e.g. 0
and then cast to int
:
对于int
需要转换NaN
到一些值,例如,0
然后转换为int
:
df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64)
Sample:
样本:
df = pd.DataFrame({'ID':['4806105017087','4806105017087','CN414149']})
print (df)
ID
0 4806105017087
1 4806105017087
2 CN414149
print (pd.to_numeric(df.ID, errors='coerce'))
0 4.806105e+12
1 4.806105e+12
2 NaN
Name: ID, dtype: float64
df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64)
print (df)
ID
0 4806105017087
1 4806105017087
2 0
EDIT: If use pandas 0.25+ then is possible use integer_na
:
编辑:如果使用 Pandas 0.25+ 那么可以使用integer_na
:
df.ID = pd.to_numeric(df.ID, errors='coerce').astype('Int64')
print (df)
ID
0 4806105017087
1 4806105017087
2 NaN