pandas read_csv 中的最佳块大小是多少以最大化速度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35235010/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
what is the optimal chunksize in pandas read_csv to maximize speed?
提问by ??????
I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas pd.read_csv()
with a chunksize=10,000 parameter.
我正在使用 20GB(压缩).csv 文件,并使用pd.read_csv()
带有 chunksize=10,000 参数的 Pandas从中加载几列。
However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data.
然而,这个参数完全是任意的,我想知道一个简单的公式是否可以给我更好的块大小来加速数据的加载。
Any ideas?
有任何想法吗?
回答by smci
There is no "optimal chunksize" [*]. Because chunksize
only tells you the number of rowsper chunk, not the memory-size of a single row, hence it's meaningless to try to make a rule-of-thumb on that. ([*] although generally I've only ever seen chunksizes in the range 100..64K)
没有“最佳块大小” [*]。因为chunksize
只告诉您每个块的行数,而不是单行的内存大小,因此尝试对此制定经验法则是没有意义的。([*] 虽然通常我只见过 100..64K 范围内的块大小)
To get memory size, you'd have to convert that to a memory-size-per-chunk or -per-row...
要获得memory size,您必须将其转换为 memory-size-per-chunk 或 -per-row ...
by looking at your number of columns, their dtypes, and the size of each; use either df.describe()
, or else for more in-depth memory usage, by column:
通过查看您的列数、它们的 dtype 以及每个的大小;使用df.describe()
, 或 else 进行更深入的内存使用,按列:
print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]
Make sure you're not blowing out all your free memorywhile reading the csv: use your OS (Unix
top
/Windows Task Manager/MacOS Activity Monitor/etc) to see how much memory is being used.One pitfall with pandas is that missing/NaN values, Python strs and objects take 32 or 48 bytes, instead of the expected 4 bytes for np.int32 or 1 byte for np.int8 column. Even one NaN value in an entire column will cause that memory blowup on the entire column, and
pandas.read_csv() dtypes, converters, na_values
arguments will not prevent the np.nan, and will ignore the desired dtype(!). A workaround is to manually post-process each chunk beforeinserting in the dataframe.And use all the standard pandas
read_csv
tricks, like:- specify
dtypes
for each column to reduce memory usage- absolutely avoid every entry being read as string, especially long unique strings like datetimes, which is terrible for memory usage - specify
usecols
if you only want to keep a subset of columns - use date/time-convertersrather than pd.Categorical if you want to reduce from 48 bytes to 1 or 4.
- read large files in chunks. And if you know upfront what you're going to impute NA/missing values with, if possible do as much of that filling as you process each chunk, instead of at the end. If you can't impute with the final value, you probably at least can replace with a sentinel value like -1, 999, -Inf etc. and later you can do the proper imputation.
- specify
确保在读取 csv 时没有耗尽所有可用内存:使用您的操作系统(Unix
top
/Windows 任务管理器/MacOS 活动监视器等)查看正在使用多少内存。pandas 的一个陷阱是缺失/NaN 值、Python strs 和对象占用 32 或 48 个字节,而不是 np.int32 的预期 4 个字节或 np.int8 列的 1 个字节。即使整个列中的一个 NaN 值也会导致整个列的内存爆炸,并且
pandas.read_csv() dtypes, converters, na_values
参数不会阻止 np.nan,并且会忽略所需的 dtype(!)。一种解决方法是在插入数据帧之前手动对每个块进行后处理。并使用所有标准的Pandas
read_csv
技巧,例如:dtypes
为每一列指定以减少内存使用- 绝对避免每个条目都被读取为字符串,尤其是像日期时间这样的长唯一字符串,这对内存使用来说很糟糕- 指定
usecols
是否只想保留列的子集 - 如果您想从 48 个字节减少到 1 或 4 个字节,请使用日期/时间转换器而不是 pd.Categorical。
- 分块读取大文件。而且,如果您预先知道将用什么来估算 NA/缺失值,则尽可能在处理每个块时进行填充,而不是在最后进行填充。如果您不能用最终值进行插补,您可能至少可以用 -1、999、-Inf 等标记值替换,然后您可以进行适当的插补。