pandas.DataFrame.to_sql 中的最佳块大小参数

Question

提问by Kevin

Working with a large pandas DataFrame that needs to be dumped into a PostgreSQL table. From what I've read it's not a good idea to dump all at once, (and I was locking up the db) rather use the chunksizeparameter. The answers hereare helpful for workflow, but I'm just asking about the value of chunksize affecting performance.

使用需要转储到 PostgreSQL 表中的大型 Pandas DataFrame。从我读过的内容来看，一次全部转储并不是一个好主意，（我正在锁定数据库）而是使用chunksize参数。这里的答案对工作流程很有帮助，但我只是询问影响性能的块大小的价值。

In [5]: df.shape
Out[5]: (24594591, 4)

In [6]: df.to_sql('existing_table',
                  con=engine, 
                  index=False, 
                  if_exists='append', 
                  chunksize=10000)

Is there a recommended default and is there a difference in performance when setting the parameter higher or lower? Assuming I have the memory to support a larger chunksize, will it execute faster?

是否有推荐的默认值，将参数设置得更高或更低时是否有性能差异？假设我有内存来支持更大的块大小，它会执行得更快吗？

Answer 1

回答by Mohamed Amin Chairi

I tried something the other way around. From sql to csv and I noticed that the smaller the chunksize the quicker the job was done. Adding additional cpus to the job (multiprocessing) didn't change anything.

我尝试了相反的方法。从 sql 到 csv，我注意到块越小，工作完成得越快。向作业（多处理）添加额外的 CPU 没有任何改变。

pandas.DataFrame.to_sql 中的最佳块大小参数

提问by Kevin

回答by Mohamed Amin Chairi

相关推荐

最近更新

标签

pandas.DataFrame.to_sql 中的最佳块大小参数

提问by Kevin

回答by Mohamed Amin Chairi

相关推荐

pandas 熊猫格式列作为货币

pandas 熊猫分组结果为多列

pandas 如何将小时添加到熊猫数据框列

pandas 熊猫在 groupby 中设置值

相关推荐

最近更新

标签