加速 Pandas to_sql()?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41554963/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Speeding up Pandas to_sql()?
提问by user1566200
I have a 1,000,000
x 50
Pandas DataFrame
that I am currently writing to a SQL table using:
我有一个1,000,000
x 50
Pandas DataFrame
,我目前正在使用以下命令写入 SQL 表:
df.to_sql('my_table', con, index=False)
df.to_sql('my_table', con, index=False)
It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.
这需要非常长的时间。我已经看到了关于如何在网上加速这个过程的各种解释,但它们似乎都不适用于 MSSQL。
If I try the method in:
Bulk Insert A Pandas DataFrame Using SQLAlchemy
then I get a
no attribute copy_from
error.If I try the multithreading method from:
http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html
then I get a
QueuePool limit of size 5 overflow 10 reach, connection timed out
error.
如果我尝试以下方法:
然后我得到一个
no attribute copy_from
错误。如果我尝试以下多线程方法:
http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html
然后我得到一个
QueuePool limit of size 5 overflow 10 reach, connection timed out
错误。
Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?
有什么简单的方法可以将 to_sql() 加速到 MSSQL 表吗?通过 BULK COPY 或其他一些方法,但完全来自 Python 代码?
回答by Babu Arunachalam
I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.
我使用 ctds 进行批量插入,使用 SQL Server 执行速度要快得多。在下面的示例中, df 是 Pandas DataFrame。DataFrame 中的列顺序与 mydb 的架构相同。
import ctds
conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))
回答by rohit singh
even I had the same issue so I applied sqlalchemy with fast execute many.
甚至我也有同样的问题,所以我应用了快速执行的 sqlalchemy。
from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
@event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
if executemany:
cursor.fast_executemany = True # replace from execute many to fast_executemany.
cursor.commit()
always make sure that the given function should be present after the engine variable and before cursor execute.
始终确保给定的函数应该在引擎变量之后和游标执行之前出现。
conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.
回答by Yuval Cohen Hadad
in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit
在 pandas 0.24 中,您可以使用块大小为 1000 的方法 ='multi',这是 sql server 限制
chunksize=1000, method='multi'
块大小= 1000,方法=“多”
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
New in version 0.24.0.
0.24.0 版中的新功能。
The parameter method controls the SQL insertion clause used. Possible values are:
参数方法控制使用的 SQL 插入子句。可能的值为:
None: Uses standard SQL INSERT clause (one per row). 'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.
无:使用标准 SQL INSERT 子句(每行一个)。'multi':在单个 INSERT 子句中传递多个值。它使用了一种并非所有后端都支持的特殊 SQL 语法。这通常为 Presto 和 Redshift 等分析数据库提供更好的性能,但如果表包含许多列,则对传统 SQL 后端的性能更差。有关更多信息,请查看 SQLAlchemy 文档。