Python Pandas - 使用 to_sql 以块的形式写入大型数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24007762/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas - Using to_sql to write large data frames in chunks
提问by Krishan Gupta
I'm using Pandas' to_sqlfunction to write to MySQL, which is timing out due to large frame size (1M rows, 20 columns).
我正在使用 Pandas 的to_sql函数写入 MySQL,由于大帧大小(1M 行,20 列)而超时。
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
Is there a more official way to chunk through the data and write rows in blocks? I've written my own code, which seems to work. I'd prefer an official solution though. Thanks!
有没有更正式的方法来分块数据并在块中写入行?我已经编写了自己的代码,这似乎有效。不过,我更喜欢官方解决方案。谢谢!
def write_to_db(engine, frame, table_name, chunk_size):
start_index = 0
end_index = chunk_size if chunk_size < len(frame) else len(frame)
frame = frame.where(pd.notnull(frame), None)
if_exists_param = 'replace'
while start_index != end_index:
print "Writing rows %s through %s" % (start_index, end_index)
frame.iloc[start_index:end_index, :].to_sql(con=engine, name=table_name, if_exists=if_exists_param)
if_exists_param = 'append'
start_index = min(start_index + chunk_size, len(frame))
end_index = min(end_index + chunk_size, len(frame))
engine = sqlalchemy.create_engine('mysql://...') #database details omited
write_to_db(engine, frame, 'retail_pendingcustomers', 20000)
回答by joris
Update: this functionality has been merged in pandas master and will be released in 0.15 (probably end of september), thanks to @artemyk! See https://github.com/pydata/pandas/pull/8062
更新:此功能已合并到 pandas master 中,并将在 0.15(可能在 9 月底)发布,感谢 @artemyk!见https://github.com/pydata/pandas/pull/8062
So starting from 0.15, you can specify the chunksizeargument and e.g. simply do:
所以从 0.15 开始,你可以指定chunksize参数,例如简单地做:
df.to_sql('table', engine, chunksize=20000)
回答by nes
There is beautiful idiomatic function chunks provided in answer to this question
在回答这个问题时提供了漂亮的惯用功能块
In your case you can use this function like this:
在您的情况下,您可以像这样使用此功能:
def chunks(l, n):
""" Yield successive n-sized chunks from l.
"""
for i in xrange(0, len(l), n):
yield l.iloc[i:i+n]
def write_to_db(engine, frame, table_name, chunk_size):
for idx, chunk in enumerate(chunks(frame, chunk_size)):
if idx == 0:
if_exists_param = 'replace':
else:
if_exists_param = 'append'
chunk.to_sql(con=engine, name=table_name, if_exists=if_exists_param)
Only drawback that it doesn't support slicing second index in iloc function.
唯一的缺点是它不支持在 iloc 函数中对第二个索引进行切片。

