pandas 熊猫将 NULL 读取为 NaN 浮点数而不是 str

Question

提问by alvas

Given the file:

鉴于文件：

$ cat test.csv 
a,b,c,NULL,d
e,f,g,h,i
j,k,l,m,n

Where the 3rd column is to be treated as str.

其中第 3 列将被视为str.

When I did a string function on the column, pandashas read the NULLstr as a NaNfloat:

当我在列上执行字符串函数时，pandas已将NULLstr读取为NaN浮点数：

>>> import pandas as pd
>>> df = pd.read_csv('test.csv', names=[0,1,2,3,4], dtype={0:str, 1:str, 2:str, 3:str, 4:str})

>>> df[3].apply(str.strip)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 2355, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/src/inference.pyx", line 1569, in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)
TypeError: descriptor 'strip' requires a 'str' object but received a 'float'

To verify:

验证：

>>> for i in df[3]:
...    print (type(i), i)
... 
<class 'float'> nan
<class 'str'> h
<class 'str'> m

I've specified the dtypeat initialization but somehow it got overriden.

我已经dtype在初始化时指定了，但不知何故它被覆盖了。

How do I force the type of a specific column to be fixed?

如何强制固定特定列的类型？

Is there a way of automatically finding these abnormal NaNfloats and change then back to 'NULL'string?

有没有办法自动找到这些异常的NaN浮点数然后改回'NULL'字符串？

Answer 1

回答by jezrael

For me works astype:

对我来说有效astype：

df[3] = df[3].astype(str)

for i in df[3]:
    print (type(i), i)

<class 'str'> nan
<class 'str'> h
<class 'str'> m

Another solution is use keep_default_na=Falsein read_csv:

另一种解决方案是使用keep_default_na=False在read_csv：

import pandas as pd
from pandas.compat import StringIO

temp=u"""a,b,c,NULL,d
e,f,g,h,i
j,k,l,m,n"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp),  names=[0,1,2,3,4], keep_default_na=False)
print (df)
   0  1  2     3  4
0  a  b  c  NULL  d
1  e  f  g     h  i
2  j  k  l     m  n

for i in df[3]:
    print (type(i), i)
<class 'str'> NULL
<class 'str'> h
<class 'str'> m

Then is possible use na_valuesparameter if need parse NaNin numeric columns, but it has to be different e.g. NA:

na_values如果需要解析NaN数字列，则可以使用参数，但它必须不同，例如NA：

import pandas as pd
from pandas.compat import StringIO

temp=u"""a,b,c,NULL,1
e,f,g,h,2
j,k,l,m,NA"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp),  names=[0,1,2,3,4], keep_default_na=False, na_values=['NA'])
print (df)
   0  1  2     3    4
0  a  b  c  NULL  1.0
1  e  f  g     h  2.0
2  j  k  l     m  NaN

for i in df[3]:
    print (type(i), i)
<class 'str'> NULL
<class 'str'> h
<class 'str'> m

for i in df[4]:
    print (type(i), i)
<class 'numpy.float64'> 1.0
<class 'numpy.float64'> 2.0
<class 'numpy.float64'> nan

pandas 熊猫将 NULL 读取为 NaN 浮点数而不是 str

提问by alvas

回答by jezrael

相关推荐

最近更新

标签

pandas 熊猫将 NULL 读取为 NaN 浮点数而不是 str

提问by alvas

回答by jezrael

相关推荐

Python Pandas 数据读取器不工作

导入错误：没有使用 Ubuntu 的名为“pandas”的模块

Pandas：摆脱多索引

Python Pandas 并排绘制两个 BARH

相关推荐

最近更新

标签