pandas 熊猫将 NULL 读取为 NaN 浮点数而不是 str

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44128033/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:40:23  来源:igfitidea点击:

Pandas reading NULL as a NaN float instead of str

pythonpandasdataframetypesnan

提问by alvas

Given the file:

鉴于文件:

$ cat test.csv 
a,b,c,NULL,d
e,f,g,h,i
j,k,l,m,n

Where the 3rd column is to be treated as str.

其中第 3 列将被视为str.

When I did a string function on the column, pandashas read the NULLstr as a NaNfloat:

当我在列上执行字符串函数时,pandas已将NULLstr读取为NaN浮点数:

>>> import pandas as pd
>>> df = pd.read_csv('test.csv', names=[0,1,2,3,4], dtype={0:str, 1:str, 2:str, 3:str, 4:str})

>>> df[3].apply(str.strip)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 2355, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/src/inference.pyx", line 1569, in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)
TypeError: descriptor 'strip' requires a 'str' object but received a 'float'

To verify:

验证:

>>> for i in df[3]:
...    print (type(i), i)
... 
<class 'float'> nan
<class 'str'> h
<class 'str'> m

I've specified the dtypeat initialization but somehow it got overriden.

我已经dtype在初始化时指定了,但不知何故它被覆盖了。

How do I force the type of a specific column to be fixed?

如何强制固定特定列的类型?

Is there a way of automatically finding these abnormal NaNfloats and change then back to 'NULL'string?

有没有办法自动找到这些异常的NaN浮点数然后改回'NULL'字符串?

回答by jezrael

For me works astype:

对我来说有效astype

df[3] = df[3].astype(str)

for i in df[3]:
    print (type(i), i)

<class 'str'> nan
<class 'str'> h
<class 'str'> m

Another solution is use keep_default_na=Falsein read_csv:

另一种解决方案是使用keep_default_na=Falseread_csv

import pandas as pd
from pandas.compat import StringIO

temp=u"""a,b,c,NULL,d
e,f,g,h,i
j,k,l,m,n"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp),  names=[0,1,2,3,4], keep_default_na=False)
print (df)
   0  1  2     3  4
0  a  b  c  NULL  d
1  e  f  g     h  i
2  j  k  l     m  n

for i in df[3]:
    print (type(i), i)
<class 'str'> NULL
<class 'str'> h
<class 'str'> m

Then is possible use na_valuesparameter if need parse NaNin numeric columns, but it has to be different e.g. NA:

na_values如果需要解析NaN数字列,则可以使用参数,但它必须不同,例如NA

import pandas as pd
from pandas.compat import StringIO

temp=u"""a,b,c,NULL,1
e,f,g,h,2
j,k,l,m,NA"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp),  names=[0,1,2,3,4], keep_default_na=False, na_values=['NA'])
print (df)
   0  1  2     3    4
0  a  b  c  NULL  1.0
1  e  f  g     h  2.0
2  j  k  l     m  NaN

for i in df[3]:
    print (type(i), i)
<class 'str'> NULL
<class 'str'> h
<class 'str'> m

for i in df[4]:
    print (type(i), i)
<class 'numpy.float64'> 1.0
<class 'numpy.float64'> 2.0
<class 'numpy.float64'> nan