Python 当我通过skip_footer arg时,Pandas read_csv忽略列dtypes

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24761122/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 05:08:21  来源:igfitidea点击:

Pandas read_csv ignoring column dtypes when I pass skip_footer arg

pythonpython-2.7csvpandas

提问by Ripster

When I try to import a csv file into a dataframe pandas (0.13.1) is ignoring the dtype parameter. Is there a way to stop pandas from inferring the data type on its own?

当我尝试将 csv 文件导入数据帧时,pandas (0.13.1) 忽略了 dtype 参数。有没有办法阻止熊猫自行推断数据类型?

I am merging several CSV files and sometimes the customer contains letters and pandas imports as a string. When I try to merge the two dataframes I get an error because I'm trying to merge two different types. I need everything stored as strings.

我正在合并几个 CSV 文件,有时客户包含字母和熊猫导入作为字符串。当我尝试合并两个数据帧时,出现错误,因为我试图合并两种不同的类型。我需要将所有内容存储为字符串。

Data snippet:

数据片段:

|WAREHOUSE|ERROR|CUSTOMER|ORDER NO|
|---------|-----|--------|--------|
|3615     |     |03106   |253734  |
|3615     |     |03156   |290550  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |

Import line:

进口线:

df = pd.read_csv("SomeFile.csv", 
                 header=1,
                 skip_footer=1, 
                 usecols=[2, 3], 
                 dtype={'ORDER NO': str, 'CUSTOMER': str})

df.dtypesoutputs this:

df.dtypes输出这个:

ORDER NO    int64
CUSTOMER    int64
dtype: object

采纳答案by Ripster

Pandas 0.13.1 silently ignored the dtypeargument because the c enginedoes not support skip_footer. This caused Pandas to fall back to the python enginewhich does not support dtype.

Pandas 0.13.1 默默地忽略了这个dtype论点,因为c engine不支持skip_footer. 这导致 Pandas 回落到python engine不支持dtype.

Solution? Use converters

解决方案?用converters

df = pd.read_csv('SomeFile.csv', 
                 header=1,
                 skip_footer=1, 
                 usecols=[2, 3], 
                 converters={'CUSTOMER': str, 'ORDER NO': str},
                 engine='python')

Output:

输出:

In [1]: df.dtypes
Out[2]:
CUSTOMER    object
ORDER NO    object
dtype: object

In [3]: type(df['CUSTOMER'][0])
Out[4]: str

In [5]: df.head()
Out[6]:
  CUSTOMER ORDER NO
0    03106   253734
1    03156   290550
2    03175   262207
3    03175   262207
4    03175   262207

Leading 0's from the original file are preserved and all data is stored as strings.

保留原始文件的前导 0,所有数据都存储为字符串。

回答by Rune Lyngsoe

Unfortunately using converters or newer pandas versions doesn't solve the more general problem of always ensuring that read_csv doesn't infer a float64 dtype. With pandas 0.15.2, the following example, with a CSV containing integers in hexadecimal notation with NULL entries, shows that using converters for what the name implies they should be used for, interferes with dtype specification.

不幸的是,使用转换器或更新的 Pandas 版本并不能解决更普遍的问题,即始终确保 read_csv 不会推断出 float64 dtype。对于 Pandas 0.15.2,以下示例中的 CSV 包含十六进制表示法的整数和 NULL 条目,表明使用转换器作为名称暗示它们应该用于的用途,会干扰 dtype 规范。

In [1]: df = pd.DataFrame(dict(a = ["0xff", "0xfe"], b = ["0xfd", None], c = [None, "0xfc"], d = [None, None]))
In [2]: df.to_csv("H:/tmp.csv", index = False)
In [3]: ef = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "abcd"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "abcd"})
In [4]: ef.dtypes.map(lambda x: x)
Out[4]:
a      int64
b    float64
c    float64
d     object
dtype: object

The specified dtype of object is only respected for the all-NULL column. In this case, the float64 values can just be converted to integers, but by the pigeon hole principle, not all 64 bit integers can be represented as a float64.

指定的对象数据类型仅适用于全 NULL 列。在这种情况下,float64 值只能转换为整数,但根据鸽巢原理,并非所有 64 位整数都可以表示为 float64。

The best solution I have found for this more general case is to get pandas to read potentially problematic columns as strings, as already covered, then convert the slice with values that need conversion (and not mapping the conversion on the column, as that will again result in an automatic dtype = float64 inference).

对于这种更一般的情况,我找到的最佳解决方案是让 Pandas 将可能有问题的列作为字符串读取,如已经涵盖的那样,然后使用需要转换的值转换切片(而不是在列上映射转换,因为这将再次导致自动 dtype = float64 推断)。

In [5]: ff = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "bc"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "ad"})
In [6]: ff.dtypes
Out[6]:
a     int64
b    object
c    object
d    object
dtype: object
In [7]: for c in "bc":
   .....:     ff.loc[~pd.isnull(ff[c]), c] = ff[c][~pd.isnull(ff[c])].map(lambda x: int(x, 16))
   .....:
In [8]: ff.dtypes
Out[8]:
a     int64
b    object
c    object
d    object
dtype: object
In [9]: [(ff[c][i], type(ff[c][i])) for c in ff.columns for i in ff.index]
Out[9]:
[(255, numpy.int64),
 (254, numpy.int64),
 (253L, long),
 (nan, float),
 (nan, float),
 (252L, long),
 (None, NoneType),
 (None, NoneType)]

As far as I have been able to determine, at least up to version 0.15.2 there is no way to avoid postprocessing of string values in situations like this.

据我所知,至少在 0.15.2 版本之前,无法避免在这种情况下对字符串值进行后处理。