pandas 大 TSV 文件中大部分为整数字符串列的熊猫 read_csv dtype 推断不一致
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/18471859/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file
提问by andrew
I have a tab separated file with a column that should be interpreted as a string, but many of the entries are integers. With small files read_csv correctly interprets the column as a string after seeing some non integer values, but with larger files, this doesnt work:
我有一个制表符分隔的文件,其中有一列应该被解释为字符串,但许多条目都是整数。对于小文件 read_csv 在看到一些非整数值后正确地将列解释为字符串,但对于较大的文件,这不起作用:
import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
df.to_csv('test', sep='\t', index=False, na_rep='NA')
df2 = pd.read_csv('test', sep='\t')
print df2['a'].unique()
for a in df2['a'][262140:262150]:
    print repr(a)
output:
输出:
['1' 'X' 1]
'1'
'1'
'1'
'1'
1
1
1
1
1
1
Interestingly 262144 is a power of 2 so I think inference and conversion is happening in chunks but is skipping some chunks.
有趣的是 262144 是 2 的幂,所以我认为推理和转换是分块进行的,但会跳过一些块。
I am fairly certain this is a bug, but would like a work around that perhaps uses quoting, though adding quoting=csv.QUOTE_NONNUMERIC for reading and writing does not fix the problem. Ideally I could work around this by quoting my string data and somehow force pandas to not do any inference on quoted data.
我相当确定这是一个错误,但想要一个可能使用引用的解决方法,尽管添加 quoting=csv.QUOTE_NONNUMERIC 进行读取和写入并不能解决问题。理想情况下,我可以通过引用我的字符串数据来解决这个问题,并以某种方式强制 Pandas 不对引用的数据进行任何推断。
Using pandas 0.12.0
使用Pandas 0.12.0
采纳答案by Andy Hayden
You've tricked the read_csv parser here (and to be fair, I don't think it can alwaysbe expected to output correctly no matter what you throw at it)... but yes, it could be a bug!
你在这里欺骗了 read_csv 解析器(公平地说,我认为无论你扔什么东西它都不能总是正确输出)......但是是的,它可能是一个错误!
As @Steven points out you can use the converters argument of read_csv:
正如@Steven 指出的,您可以使用read_csv的 converters 参数:
df2 = pd.read_csv('test', sep='\t', converters={'a': str})
A lazy solution is just to patch this up after you've read in the file:
一个懒惰的解决方案是在您阅读文件后修补它:
In [11]: df2['a'] = df2['a'].astype('str')
# now they are equal
In [12]: pd.util.testing.assert_frame_equal(df, df2)
Note: If you are looking for a solution to store DataFrames, e.g. between sessions, both pickle and HDF5Store are excellent solutions which won't be affected by these type of parsing bugs (and will be considerably faster). See: How to store data frame using PANDAS, Python
注意:如果您正在寻找存储 DataFrames 的解决方案,例如在会话之间,pickle 和 HDF5Store 都是出色的解决方案,它们不会受到这些类型的解析错误的影响(并且速度会快得多)。请参阅:如何使用 PANDAS、Python 存储数据框
回答by Steven Rumbalski
To avoid having Pandas infer your data type, provide a convertersargument to read_csv:
为避免让 Pandas 推断您的数据类型,请提供一个converters参数read_csv:
converters: dict. optionalDict of functions for converting values in certain columns. Keys can either be integers or column labels
converters: 字典。可选的用于转换某些列中的值的函数字典。键可以是整数或列标签
For your file this would look like:
对于您的文件,这看起来像:
df2 = pd.read_csv('test', sep='\t', converters={'a':str})
My reading of the docs is that you do not need to specify converters for every column. Pandas should continue to infer the datatype of unspecified columns.
我对文档的阅读是您不需要为每一列指定转换器。Pandas 应该继续推断未指定列的数据类型。

