Python 读取 csv 文件时的混合类型。原因、修复和后果

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25488675/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:17:43  来源:igfitidea点击:

Mixed types when reading csv files. Causes, fixes and consequences

pythoncsvpandas

提问by Amelio Vazquez-Reina

What exactly happens when Pandas issues this warning? Should I worry about it?

当 Pandas 发出这个警告时究竟会发生什么?我应该担心吗?

In [1]: read_csv(path_to_my_file)
/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/parsers.py:1139: 
DtypeWarning: Columns (4,13,29,51,56,57,58,63,87,96) have mixed types. Specify dtype option on import or set low_memory=False.              

  data = self._reader.read(nrows)

I assume that this means that Pandas is unable to infer the type from values on those columns. But if that is the case, what type does Pandas end up using for those columns?

我认为这意味着 Pandas 无法从这些列上的值推断类型。但如果是这样的话,Pandas 最终会为这些列使用什么类型

Also, can the type always be recovered after the fact? (after getting the warning), or are there cases where I may not be able to recover the original info correctly, and I should pre-specify the type?

此外,该类型可以在事后始终恢复吗?(收到警告后),或者在某些情况下我可能无法正确恢复原始信息,我应该预先指定类型?

Finally, how exactly does low_memory=Falsefix the problem?

最后,究竟如何low_memory=False解决问题?

采纳答案by Robert Pollak

Revisiting mbatchkarov's link, low_memoryis not deprecated. It is now documented:

重温mbatchkarov的链接,low_memory不会被弃用。这是目前记录在案

low_memory: boolean, default True

Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtypeparameter. Note that the entire file is read into a single DataFrame regardless, use the chunksizeor iteratorparameter to return the data in chunks. (Only valid with C parser)

low_memory:布尔值,默认为 True

在内部以块的形式处理文件,导致解析时内存使用量较低,但可能会进行混合类型推断。为确保没有混合类型,请设置 False,或使用dtype参数指定类型。请注意,无论将整个文件读入单个 DataFrame,请使用chunksizeiterator参数以块的形式返回数据。(仅对 C 解析器有效)

I have askedwhat resulting in mixed type inferencemeans, and chris-b1 answered:

我问过导致混合类型推断是什么意思,chris-b1 回答说:

It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.

它是确定性的 - 类型是根据数据中的内容一致推断的。也就是说,内部块大小不是固定的行数,而是字节数,因此是否可以混合 dtype 警告会感觉有点随机。

So, what type does Pandas end up using for those columns?

那么,Pandas 最终为这些列使用什么类型?

This is answered by the following self-contained example:

以下自包含示例回答了这个问题:

df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

type(df.loc[524287,'0'])
Out[50]: int

type(df.loc[524288,'0'])
Out[51]: str

The first part of the csv data was seen as only int, so converted to int, the second part also had a string, so all entries were kept as string.

csv数据的第一部分被视为只有int,因此转换为int,第二部分也有一个字符串,因此所有条目都保留为字符串。

Can the type always be recovered after the fact? (after getting the warning)?

事后总是可以恢复类型吗?(收到警告后)?

I guess re-exporting to csv and re-reading with low_memory=Falseshould do the job.

我想重新导出到 csv 并重新阅读low_memory=False应该可以完成这项工作。

How exactly does low_memory=False fix the problem?

low_memory=False 究竟是如何解决问题的?

It reads all of the file before deciding the type, therefore needing more memory.

它在决定类型之前读取所有文件,因此需要更多内存。

回答by mbatchkarov

low_memoryis apparently kind of deprecated, so I wouldn't bother with it.

low_memory显然有点过时,所以我不会打扰它。

The warning means that some of the values in a column have one dtype (e.g. str), and some have a different dtype (e.g. float). I believe pandas uses the lowest common super type, which in the example I used would be object.

警告意味着列中的某些值具有一种数据类型(例如str),而某些具有不同的数据类型(例如float)。我相信 Pandas 使用最低常见的超类型,在我使用的示例中是object.

You should check your data, or post some of it here. In particular, look for missing values or inconsistently formatted int/floatvalues. If you are certain your data is correct, then use the dtypesparameter to help pandasout.

你应该检查你的数据,或者在这里发布一些数据。特别是,查找缺失值或格式不一致的int/float值。如果您确定您的数据是正确的,则使用该dtypes参数来提供帮助pandas