Python 防止熊猫将字符串中的“NA”解释为 NaN

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33952142/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:14:38  来源:igfitidea点击:

Prevent pandas from interpreting 'NA' as NaN in a string

pythonpandas

提问by binarysubstrate

The pandas read_csv()method interprets 'NA' as nan (not a number) instead of a valid string.

pandas read_csv()方法将 'NA' 解释为 nan(不是数字)而不是有效字符串。

In the simple case below note that the output in row 1, column 2 (zero based count) is 'nan' instead of 'NA'.

在下面的简单情况下,请注意第 1 行第 2 列(从零开始的计数)中的输出是“nan”而不是“NA”。

sample.tsv(tab delimited)

sample.tsv(制表符分隔)

PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 1 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118

PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_EN​​D SP_BEG SP_END
5d8bÑP60490 1 146 1 146 1 146
5d8b NA P80377 1 126 1 126 1 126
5d8böP60491 1 118 1 118 1 118

read_sample.py

读取样本.py

import pandas as pd

df = pd.read_csv(
    'sample.tsv',
    sep='\t',
    encoding='utf-8',
)

for df_tuples in df.itertuples(index=True):
    print(df_tuples)

output

输出

(0, u'5d8b', u'N', u'P60490', 1, 146, 1, 146, 1, 146)
(1, u'5d8b', nan, u'P80377', 1, 126, 1, 126, 1, 126)
(2, u'5d8b', u'O', u'P60491', 1, 118, 1, 118, 1, 118)

(0, u'5d8b', u'N', u'P60490', 1, 146, 1, 146, 1, 146)
(1, u'5d8b', nan, u'P80377', 1, 126, 1 , 126, 1, 126)
(2, u'5d8b', u'O', u'P60491', 1, 118, 1, 118, 1, 118)

Additional Information

附加信息

Re-writing the file with quotes for data in the 'CHAIN' column and then using the quotechar parameter quotechar='\''has the same result. And passing a dictionary of types via the dtype parameter dtype=dict(valid_cols)does not change the result.

用引号为 'CHAIN' 列中的数据重新写入文件,然后使用 quotechar 参数quotechar='\''具有相同的结果。通过 dtype 参数传递类型字典dtype=dict(valid_cols)不会改变结果。

An old answer to Prevent pandas from automatically inferring type in read_csvsuggests first using a numpy record array to parse the file, but given the ability to now specify column dtypes, this shouldn't be necessary.

防止熊猫在 read_csv 中自动推断类型的旧答案建议首先使用 numpy 记录数组来解析文件,但鉴于现在能够指定列 dtypes,这应该不是必需的。

Note that itertuples() is used to preserve dtypes as described in the iterrows documentation: "To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns tuples of the values and which is generally faster as iterrows."

请注意, itertuples() 用于保留 dtypes,如 iterrows 文档中所述:“要在迭代行时保留 dtypes,最好使用 itertuples(),它返回值的元组,并且通常比 iterrows 更快。”

Example was tested on Python 2 and 3 with pandas version 0.16.2, 0.17.0, and 0.17.1.

示例在 Python 2 和 3 上使用 Pandas 版本 0.16.2、0.17.0 和 0.17.1 进行了测试。



Is there a way to capture a valid string 'NA' instead of it being converted to nan?

有没有办法捕获有效的字符串 'NA' 而不是将其转换为 nan?

采纳答案by Anton Protopopov

You could use parameters keep_default_naand na_valuesto set all NA values by hand docs:

您可以使用参数keep_default_nana_values手动设置所有的NA值文档

import pandas as pd
from io import StringIO

data = """
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 _ 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118
"""

df = pd.read_csv(StringIO(data), sep=' ', keep_default_na=False, na_values=['_'])

In [130]: df
Out[130]:
    PDB CHAIN SP_PRIMARY  RES_BEG  RES_END  PDB_BEG  PDB_END  SP_BEG  SP_END
0  5d8b     N     P60490        1      146        1      146       1     146
1  5d8b    NA     P80377      NaN      126        1      126       1     126
2  5d8b     O     P60491        1      118        1      118       1     118

In [144]: df.CHAIN.apply(type)
Out[144]:
0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
Name: CHAIN, dtype: object

EDIT

编辑

All default NAvalues from na-values(as of pandas1.0.0):

所有默认NA的值NA值(为pandas1.0.0):

The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].

默认的 NaN 识别值为 ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/AN/A', '#N/ A', 'N/A', 'n/a', 'NA', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-南','']。

回答by Matthew Coelho

For me solution came from using parameter na_filter = False

对我来说,解决方案来自使用参数 na_filter = False

df = pd.read_csv(file_, header=0, dtype=object, na_filter = False)

回答by arsho

Setting keep_default_naparameter does the trick.

设置keep_default_na参数可以解决问题。

Here is an example of keeping NAas string value while reading CSV file using Pandas.

这是NA使用 Pandas 读取 CSV 文件时保持字符串值的示例。

data.csv:

data.csv

country_name,country_code
Mexico,MX
Namibia,NA

read_data.py:

read_data.py

import pandas as pd
data = pd.read_csv("data.csv", keep_default_na=False)
print(data.describe())
print(data)

Output:

输出:

       country_name country_code
count             2            2
unique            2            2
top         Namibia           MX
freq              1            1

  country_name country_code
0       Mexico           MX
1      Namibia           NA

Reference:

参考: