加速 Pandas to_sql()?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41554963/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:45:19  来源:igfitidea点击:

Speeding up Pandas to_sql()?

pythonsqlperformancepandasimport

提问by user1566200

I have a 1,000,000x 50Pandas DataFramethat I am currently writing to a SQL table using:

我有一个1,000,000x 50Pandas DataFrame,我目前正在使用以下命令写入 SQL 表:

df.to_sql('my_table', con, index=False)

df.to_sql('my_table', con, index=False)

It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.

这需要非常长的时间。我已经看到了关于如何在网上加速这个过程的各种解释,但它们似乎都不适用于 MSSQL。

  1. If I try the method in:

    Bulk Insert A Pandas DataFrame Using SQLAlchemy

    then I get a no attribute copy_fromerror.

  2. If I try the multithreading method from:

    http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html

    then I get a QueuePool limit of size 5 overflow 10 reach, connection timed outerror.

  1. 如果我尝试以下方法:

    使用 SQLAlchemy 批量插入 Pandas 数据帧

    然后我得到一个no attribute copy_from错误。

  2. 如果我尝试以下多线程方法:

    http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html

    然后我得到一个QueuePool limit of size 5 overflow 10 reach, connection timed out错误。

Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?

有什么简单的方法可以将 to_sql() 加速到 MSSQL 表吗?通过 BULK COPY 或其他一些方法,但完全来自 Python 代码?

回答by Babu Arunachalam

I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.

我使用 ctds 进行批量插入,使用 SQL Server 执行速度要快得多。在下面的示例中, df 是 Pandas DataFrame。DataFrame 中的列顺序与 mydb 的架构相同。

import ctds

conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))

回答by rohit singh

even I had the same issue so I applied sqlalchemy with fast execute many.

甚至我也有同样的问题,所以我应用了快速执行的 sqlalchemy。

from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
@event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
   if executemany:
       cursor.fast_executemany = True  # replace from execute many to fast_executemany.
       cursor.commit()

always make sure that the given function should be present after the engine variable and before cursor execute.

始终确保给定的函数应该在引擎变量之后和游标执行之前出现。

conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.

回答by Yuval Cohen Hadad

in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit

在 pandas 0.24 中,您可以使用块大小为 1000 的方法 ='multi',这是 sql server 限制

chunksize=1000, method='multi'

块大小= 1000,方法=“多”

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method

New in version 0.24.0.

0.24.0 版中的新功能。

The parameter method controls the SQL insertion clause used. Possible values are:

参数方法控制使用的 SQL 插入子句。可能的值为:

None: Uses standard SQL INSERT clause (one per row). 'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.

无:使用标准 SQL INSERT 子句(每行一个)。'multi':在单个 INSERT 子句中传递多个值。它使用了一种并非所有后端都支持的特殊 SQL 语法。这通常为 Presto 和 Redshift 等分析数据库提供更好的性能,但如果表包含许多列,则对传统 SQL 后端的性能更差。有关更多信息,请查看 SQLAlchemy 文档。