pandas 使用块大小迭代获取推断的数据帧类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15555005/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:43:49  来源:igfitidea点击:

Get inferred dataframe types iteratively using chunksize

pythontype-conversionpandashdfstore

提问by Zelazny7

How can I use pd.read_csv() to iteratively chunk through a file and retain the dtype and other meta-information as if I read in the entire dataset at once?

如何使用 pd.read_csv() 对文件进行迭代分块并保留 dtype 和其他元信息,就像我一次读取整个数据集一样?

I need to read in a dataset that is too large to fit into memory. I would like to import the file using pd.read_csv and then immediately append the chunk into an HDFStore. However, the data type inference knows nothing about subsequent chunks.

我需要读入一个太大而无法放入内存的数据集。我想使用 pd.read_csv 导入文件,然后立即将块附加到 HDFStore 中。但是,数据类型推断对后续块一无所知。

If the first chunk stored in the table contains only int and a subsequent chunk contains a float, an exception will be raised. So I need to first iterate through the dataframe using read_csv and retain the highestinferred type. In addition, for object types, I need to retain the maximum length as these will be stored as strings in the table.

如果表中存储的第一个块只包含 int 并且后续块包含一个浮点数,则会引发异常。所以我需要首先使用 read_csv 遍历数据帧并保留最高的推断类型。此外,对于对象类型,我需要保留最大长度,因为它们将作为字符串存储在表中。

Is there a pandonic way of retaining only this information without reading in the entire dataset?

是否有一种仅保留这些信息而不读取整个数据集的泛滥方法?

回答by Zelazny7

I didn't think it would be this intuitive, otherwise I wouldn't have posted the question. But once again, pandas makes things a breeze. However, keeping the question as this information might be useful to others working with large data:

我不认为它会如此直观,否则我不会发布这个问题。但再一次,Pandas让事情变得轻而易举。但是,保留问题,因为此信息可能对其他人使用大数据有用:

In [1]: chunker = pd.read_csv('DATASET.csv', chunksize=500, header=0)

# Store the dtypes of each chunk into a list and convert it to a dataframe:

In [2]: dtypes = pd.DataFrame([chunk.dtypes for chunk in chunker])

In [3]: dtypes.values[:5]
Out[3]:
array([[int64, int64, int64, object, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64]], dtype=object)

# Very cool that I can take the max of these data types and it will preserve the hierarchy:

In [4]: dtypes.max().values
Out[4]: array([int64, int64, int64, object, int64, int64, int64, int64], dtype=object)

# I can now store the above into a dictionary:

types = dtypes.max().to_dict()

# And pass it into pd.read_csv fo the second run:

chunker = pd.read_csv('tree_prop_dset.csv', dtype=types, chunksize=500)