pandas read_csv 中的最佳块大小是多少以最大化速度?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35235010/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:38:14  来源:igfitidea点击:

what is the optimal chunksize in pandas read_csv to maximize speed?

pythonpandasmemoryiochunks

提问by ??????

I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas pd.read_csv()with a chunksize=10,000 parameter.

我正在使用 20GB(压缩).csv 文件,并使用pd.read_csv()带有 chunksize=10,000 参数的 Pandas从中加载几列。

However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data.

然而,这个参数完全是任意的,我想知道一个简单的公式是否可以给我更好的块大小来加速数据的加载。

Any ideas?

有任何想法吗?

回答by smci

There is no "optimal chunksize" [*]. Because chunksizeonly tells you the number of rowsper chunk, not the memory-size of a single row, hence it's meaningless to try to make a rule-of-thumb on that. ([*] although generally I've only ever seen chunksizes in the range 100..64K)

没有“最佳块大小” [*]。因为chunksize只告诉您每个块的数,而不是单行内存大小,因此尝试对此制定经验法则是没有意义的。([*] 虽然通常我只见过 100..64K 范围内的块大小)

To get memory size, you'd have to convert that to a memory-size-per-chunk or -per-row...

要获得memory size,您必须将其转换为 memory-size-per-chunk 或 -per-row ...

by looking at your number of columns, their dtypes, and the size of each; use either df.describe(), or else for more in-depth memory usage, by column:

通过查看您的列数、它们的 dtype 以及每个的大小;使用df.describe(), 或 else 进行更深入的内存使用,按列:

print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]
  • Make sure you're not blowing out all your free memorywhile reading the csv: use your OS (Unix top/Windows Task Manager/MacOS Activity Monitor/etc) to see how much memory is being used.

  • One pitfall with pandas is that missing/NaN values, Python strs and objects take 32 or 48 bytes, instead of the expected 4 bytes for np.int32 or 1 byte for np.int8 column. Even one NaN value in an entire column will cause that memory blowup on the entire column, and pandas.read_csv() dtypes, converters, na_valuesarguments will not prevent the np.nan, and will ignore the desired dtype(!). A workaround is to manually post-process each chunk beforeinserting in the dataframe.

  • And use all the standard pandas read_csvtricks, like:

    • specify dtypesfor each column to reduce memory usage- absolutely avoid every entry being read as string, especially long unique strings like datetimes, which is terrible for memory usage
    • specify usecolsif you only want to keep a subset of columns
    • use date/time-convertersrather than pd.Categorical if you want to reduce from 48 bytes to 1 or 4.
    • read large files in chunks. And if you know upfront what you're going to impute NA/missing values with, if possible do as much of that filling as you process each chunk, instead of at the end. If you can't impute with the final value, you probably at least can replace with a sentinel value like -1, 999, -Inf etc. and later you can do the proper imputation.
  • 确保在读取 csv 时没有耗尽所有可用内存:使用您的操作系统(Unix top/Windows 任务管理器/MacOS 活动监视器等)查看正在使用多少内存。

  • pandas 的一个陷阱是缺失/NaN 值、Python strs 和对象占用 32 或 48 个字节,而不是 np.int32 的预期 4 个字节或 np.int8 列的 1 个字节。即使整个列中的一个 NaN 值也会导致整个列的内存爆炸,并且pandas.read_csv() dtypes, converters, na_values参数不会阻止 np.nan,并且会忽略所需的 dtype(!)。一种解决方法是插入数据帧之前手动对每个块进行后处理。

  • 并使用所有标准的Pandasread_csv技巧,例如:

    • dtypes为每一列指定以减少内存使用- 绝对避免每个条目都被读取为字符串,尤其是像日期时间这样的长唯一字符串,这对内存使用来说很糟糕
    • 指定usecols是否只想保留列的子集
    • 如果您想从 48 个字节减少到 1 或 4 个字节,请使用日期/时间转换器而不是 pd.Categorical。
    • 分块读取大文件。而且,如果您预先知道将用什么来估算 NA/缺失值,则尽可能在处理每个块时进行填充,而不是在最后进行填充。如果您不能用最终值进行插补,您可能至少可以用 -1、999、-Inf 等标记值替换,然后您可以进行适当的插补。