加速 Pandas to_sql()？

Question

提问by user1566200

I have a 1,000,000x 50Pandas DataFramethat I am currently writing to a SQL table using:

我有一个1,000,000x 50Pandas DataFrame，我目前正在使用以下命令写入 SQL 表：

df.to_sql('my_table', con, index=False)

It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.

这需要非常长的时间。我已经看到了关于如何在网上加速这个过程的各种解释，但它们似乎都不适用于 MSSQL。

If I try the method in:
Bulk Insert A Pandas DataFrame Using SQLAlchemy
then I get a no attribute copy_fromerror.
If I try the multithreading method from:
http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html
then I get a QueuePool limit of size 5 overflow 10 reach, connection timed outerror.

如果我尝试以下方法：
使用 SQLAlchemy 批量插入 Pandas 数据帧
然后我得到一个no attribute copy_from错误。
如果我尝试以下多线程方法：
http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html
然后我得到一个QueuePool limit of size 5 overflow 10 reach, connection timed out错误。

Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?

有什么简单的方法可以将 to_sql() 加速到 MSSQL 表吗？通过 BULK COPY 或其他一些方法，但完全来自 Python 代码？

Answer 1

回答by Babu Arunachalam

I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.

我使用 ctds 进行批量插入，使用 SQL Server 执行速度要快得多。在下面的示例中， df 是 Pandas DataFrame。DataFrame 中的列顺序与 mydb 的架构相同。

import ctds

conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))

Answer 2

回答by rohit singh

even I had the same issue so I applied sqlalchemy with fast execute many.

甚至我也有同样的问题，所以我应用了快速执行的 sqlalchemy。

from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
@event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
   if executemany:
       cursor.fast_executemany = True  # replace from execute many to fast_executemany.
       cursor.commit()

always make sure that the given function should be present after the engine variable and before cursor execute.

始终确保给定的函数应该在引擎变量之后和游标执行之前出现。

conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.

Answer 3

回答by Yuval Cohen Hadad

in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit

在 pandas 0.24 中，您可以使用块大小为 1000 的方法 ='multi'，这是 sql server 限制

chunksize=1000, method='multi'

块大小= 1000，方法=“多”

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method

New in version 0.24.0.

0.24.0 版中的新功能。

The parameter method controls the SQL insertion clause used. Possible values are:

参数方法控制使用的 SQL 插入子句。可能的值为：

None: Uses standard SQL INSERT clause (one per row). 'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.

无：使用标准 SQL INSERT 子句（每行一个）。'multi'：在单个 INSERT 子句中传递多个值。它使用了一种并非所有后端都支持的特殊 SQL 语法。这通常为 Presto 和 Redshift 等分析数据库提供更好的性能，但如果表包含许多列，则对传统 SQL 后端的性能更差。有关更多信息，请查看 SQLAlchemy 文档。

加速 Pandas to_sql()？

提问by user1566200

回答by Babu Arunachalam

回答by rohit singh

回答by Yuval Cohen Hadad

相关推荐

最近更新

标签

加速 Pandas to_sql()？

提问by user1566200

回答by Babu Arunachalam

回答by rohit singh

回答by Yuval Cohen Hadad

相关推荐

pandas 如何为所选列选择一行中的最大值和最小值

在没有 elasticsearch-py 的情况下将 Pandas 数据帧索引到 Elasticsearch

Pandas dataframe.to_html() - 为标题添加背景颜色

在 Python pandas DataFrame 中为数字添加千位分隔符的简单方法

相关推荐

最近更新

标签