pandas.DataFrame.to_sql 中的最佳块大小参数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35202981/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Optimal chunksize parameter in pandas.DataFrame.to_sql
提问by Kevin
Working with a large pandas DataFrame that needs to be dumped into a PostgreSQL table. From what I've read it's not a good idea to dump all at once, (and I was locking up the db) rather use the chunksizeparameter. The answers hereare helpful for workflow, but I'm just asking about the value of chunksize affecting performance.
使用需要转储到 PostgreSQL 表中的大型 Pandas DataFrame。从我读过的内容来看,一次全部转储并不是一个好主意,(我正在锁定数据库)而是使用chunksize参数。这里的答案对工作流程很有帮助,但我只是询问影响性能的块大小的价值。
In [5]: df.shape
Out[5]: (24594591, 4)
In [6]: df.to_sql('existing_table',
con=engine,
index=False,
if_exists='append',
chunksize=10000)
Is there a recommended default and is there a difference in performance when setting the parameter higher or lower? Assuming I have the memory to support a larger chunksize, will it execute faster?
是否有推荐的默认值,将参数设置得更高或更低时是否有性能差异?假设我有内存来支持更大的块大小,它会执行得更快吗?
回答by Mohamed Amin Chairi
I tried something the other way around. From sql to csv and I noticed that the smaller the chunksize the quicker the job was done. Adding additional cpus to the job (multiprocessing) didn't change anything.
我尝试了相反的方法。从 sql 到 csv,我注意到块越小,工作完成得越快。向作业(多处理)添加额外的 CPU 没有任何改变。

