Python 防止熊猫将字符串中的“NA”解释为 NaN
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33952142/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Prevent pandas from interpreting 'NA' as NaN in a string
提问by binarysubstrate
The pandas read_csv()method interprets 'NA' as nan (not a number) instead of a valid string.
pandas read_csv()方法将 'NA' 解释为 nan(不是数字)而不是有效字符串。
In the simple case below note that the output in row 1, column 2 (zero based count) is 'nan' instead of 'NA'.
在下面的简单情况下,请注意第 1 行第 2 列(从零开始的计数)中的输出是“nan”而不是“NA”。
sample.tsv(tab delimited)
sample.tsv(制表符分隔)
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 1 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8bÑP60490 1 146 1 146 1 146
5d8b NA P80377 1 126 1 126 1 126
5d8böP60491 1 118 1 118 1 118
read_sample.py
读取样本.py
import pandas as pd
df = pd.read_csv(
'sample.tsv',
sep='\t',
encoding='utf-8',
)
for df_tuples in df.itertuples(index=True):
print(df_tuples)
output
输出
(0, u'5d8b', u'N', u'P60490', 1, 146, 1, 146, 1, 146)
(1, u'5d8b', nan, u'P80377', 1, 126, 1, 126, 1, 126)
(2, u'5d8b', u'O', u'P60491', 1, 118, 1, 118, 1, 118)
(0, u'5d8b', u'N', u'P60490', 1, 146, 1, 146, 1, 146)
(1, u'5d8b', nan, u'P80377', 1, 126, 1 , 126, 1, 126)
(2, u'5d8b', u'O', u'P60491', 1, 118, 1, 118, 1, 118)
Additional Information
附加信息
Re-writing the file with quotes for data in the 'CHAIN' column and then using the quotechar parameter quotechar='\''
has the same result. And passing a dictionary of types via the dtype parameter dtype=dict(valid_cols)
does not change the result.
用引号为 'CHAIN' 列中的数据重新写入文件,然后使用 quotechar 参数quotechar='\''
具有相同的结果。通过 dtype 参数传递类型字典dtype=dict(valid_cols)
不会改变结果。
An old answer to Prevent pandas from automatically inferring type in read_csvsuggests first using a numpy record array to parse the file, but given the ability to now specify column dtypes, this shouldn't be necessary.
防止熊猫在 read_csv 中自动推断类型的旧答案建议首先使用 numpy 记录数组来解析文件,但鉴于现在能够指定列 dtypes,这应该不是必需的。
Note that itertuples() is used to preserve dtypes as described in the iterrows documentation: "To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns tuples of the values and which is generally faster as iterrows."
请注意, itertuples() 用于保留 dtypes,如 iterrows 文档中所述:“要在迭代行时保留 dtypes,最好使用 itertuples(),它返回值的元组,并且通常比 iterrows 更快。”
Example was tested on Python 2 and 3 with pandas version 0.16.2, 0.17.0, and 0.17.1.
示例在 Python 2 和 3 上使用 Pandas 版本 0.16.2、0.17.0 和 0.17.1 进行了测试。
Is there a way to capture a valid string 'NA' instead of it being converted to nan?
有没有办法捕获有效的字符串 'NA' 而不是将其转换为 nan?
采纳答案by Anton Protopopov
You could use parameters keep_default_na
and na_values
to set all NA values by hand docs:
您可以使用参数keep_default_na
和na_values
手动设置所有的NA值文档:
import pandas as pd
from io import StringIO
data = """
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 _ 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118
"""
df = pd.read_csv(StringIO(data), sep=' ', keep_default_na=False, na_values=['_'])
In [130]: df
Out[130]:
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
0 5d8b N P60490 1 146 1 146 1 146
1 5d8b NA P80377 NaN 126 1 126 1 126
2 5d8b O P60491 1 118 1 118 1 118
In [144]: df.CHAIN.apply(type)
Out[144]:
0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
Name: CHAIN, dtype: object
EDIT
编辑
All default NA
values from na-values(as of pandas
1.0.0):
所有默认NA
的值NA值(为pandas
1.0.0):
The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].
默认的 NaN 识别值为 ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/AN/A', '#N/ A', 'N/A', 'n/a', 'NA', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-南','']。
回答by Matthew Coelho
For me solution came from using parameter na_filter = False
对我来说,解决方案来自使用参数 na_filter = False
df = pd.read_csv(file_, header=0, dtype=object, na_filter = False)
回答by arsho
Setting keep_default_na
parameter does the trick.
设置keep_default_na
参数可以解决问题。
Here is an example of keeping NA
as string value while reading CSV file using Pandas.
这是NA
使用 Pandas 读取 CSV 文件时保持字符串值的示例。
data.csv
:
data.csv
:
country_name,country_code
Mexico,MX
Namibia,NA
read_data.py
:
read_data.py
:
import pandas as pd
data = pd.read_csv("data.csv", keep_default_na=False)
print(data.describe())
print(data)
Output:
输出:
country_name country_code
count 2 2
unique 2 2
top Namibia MX
freq 1 1
country_name country_code
0 Mexico MX
1 Namibia NA
Reference:
参考: