Python 使用 pyODBC 的 fast_executemany 加速 pandas.DataFrame.to_sql
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48006551/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC
提问by J.K.
I would like to send a large pandas.DataFrame
to a remote server running MS SQL. The way I do it now is by converting a data_frame
object to a list of tuples and then send it away with pyODBC's executemany()
function. It goes something like this:
我想向pandas.DataFrame
运行 MS SQL 的远程服务器发送一个大文件。我现在这样做的方法是将data_frame
对象转换为元组列表,然后使用 pyODBC 的executemany()
函数将其发送出去。它是这样的:
import pyodbc as pdb
list_of_tuples = convert_df(data_frame)
connection = pdb.connect(cnxn_str)
cursor = connection.cursor()
cursor.fast_executemany = True
cursor.executemany(sql_statement, list_of_tuples)
connection.commit()
cursor.close()
connection.close()
I then started to wonder if things can be sped up (or at least more readable) by using data_frame.to_sql()
method. I have came up with the following solution:
然后我开始想知道是否可以通过使用data_frame.to_sql()
方法来加速(或至少更具可读性)。我想出了以下解决方案:
import sqlalchemy as sa
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % cnxn_str)
data_frame.to_sql(table_name, engine, index=False)
Now the code is more readable, but the upload is at least 150 times slower...
现在代码更具可读性,但上传速度至少慢了 150 倍......
Is there a way to flip the fast_executemany
when using SQLAlchemy?
有没有办法fast_executemany
在使用 SQLAlchemy 时翻转?
I am using pandas-0.20.3, pyODBC-4.0.21 and sqlalchemy-1.1.13.
我正在使用 pandas-0.20.3、pyODBC-4.0.21 和 sqlalchemy-1.1.13。
采纳答案by J.K.
After contacting the developers of SQLAlchemy, a way to solve this problem has emerged. Many thanks to them for the great work!
联系了SQLAlchemy的开发者后,出现了解决这个问题的方法。非常感谢他们的出色工作!
One has to use a cursor execution event and check if the executemany
flag has been raised. If that is indeed the case, switch the fast_executemany
option on. For example:
必须使用游标执行事件并检查executemany
标志是否已引发。如果确实如此,请打开该fast_executemany
选项。例如:
from sqlalchemy import event
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
More information on execution events can be found here.
可以在此处找到有关执行事件的更多信息。
UPDATE:Support for fast_executemany
of pyodbc
was added in SQLAlchemy 1.3.0, so this hack is not longer necessary.
更新:SQLAlchemy 1.3.0中添加了对fast_executemany
of 的支持,因此不再需要此 hack。pyodbc
回答by hetspookjee
EDIT (2019-03-08):Gord Thompson commented below with good news from the update logs of sqlalchemy: Since SQLAlchemy 1.3.0, released 2019-03-04, sqlalchemy now supports engine = create_engine(sqlalchemy_url, fast_executemany=True)
for the mssql+pyodbc
dialect. I.e., it is no longer necessary to define a function and use @event.listens_for(engine, 'before_cursor_execute')
Meaning the below function can be removed and only the flag needs to be set in the create_engine statement - and still retaining the speed-up.
编辑(2019年3月8日):戈德汤普森从SQLAlchemy的的更新日志的好消息如下评论:由于SQLAlchemy的1.3.0,2019年3月4日,SQLAlchemy的发布现在支持engine = create_engine(sqlalchemy_url, fast_executemany=True)
的mssql+pyodbc
方言。即,不再需要定义一个函数并使用这@event.listens_for(engine, 'before_cursor_execute')
意味着可以删除下面的函数,只需要在 create_engine 语句中设置标志 - 并且仍然保持加速。
Original Post:
原帖:
Just made an account to post this. I wanted to comment beneath the above thread as it's a followup on the already provided answer. The solution above worked for me with the Version 17 SQL driver on a Microsft SQL storage writing from a Ubuntu based install.
刚刚注册了一个账号来发布这个。我想在上面的帖子下面发表评论,因为它是对已经提供的答案的跟进。上面的解决方案在基于 Ubuntu 的安装的 Microsft SQL 存储上使用 Version 17 SQL 驱动程序对我有用。
The complete code I used to speed things up significantly (talking >100x speed-up) is below. This is a turn-key snippet provided that you alter the connection string with your relevant details. To the poster above, thank you very much for the solution as I was looking quite some time for this already.
我用来显着加快速度(速度超过 100 倍)的完整代码如下。这是一个交钥匙片段,前提是您使用相关详细信息更改连接字符串。对于上面的海报,非常感谢您的解决方案,因为我已经为此寻找了很长时间。
import pandas as pd
import numpy as np
import time
from sqlalchemy import create_engine, event
from urllib.parse import quote_plus
conn = "DRIVER={ODBC Driver 17 for SQL Server};SERVER=IP_ADDRESS;DATABASE=DataLake;UID=USER;PWD=PASS"
quoted = quote_plus(conn)
new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
engine = create_engine(new_con)
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
print("FUNC call")
if executemany:
cursor.fast_executemany = True
table_name = 'fast_executemany_test'
df = pd.DataFrame(np.random.random((10**4, 100)))
s = time.time()
df.to_sql(table_name, engine, if_exists = 'replace', chunksize = None)
print(time.time() - s)
Based on the comments below I wanted to take some time to explain some limitations about the pandas to_sql
implementation and the way the query is handled. There are 2 things that might cause the MemoryError
being raised afaik:
根据下面的评论,我想花一些时间来解释有关 Pandasto_sql
实现和查询处理方式的一些限制。有两件事可能会导致MemoryError
被提出 afaik:
1) Assuming you're writing to a remote SQL storage. When you try to write a large pandas DataFrame with the to_sql
method it converts the entire dataframe into a list of values. This transformation takes up way more RAM than the original DataFrame does (on top of it, as the old DataFrame still remains present in RAM). This list is provided to the final executemany
call for your ODBC connector. I think the ODBC connector has some troubles handling such large queries. A way to solve this is to provide the to_sql
method a chunksize argument (10**5 seems to be around optimal giving about 600 mbit/s (!) write speeds on a 2 CPU 7GB ram MSSQL Storage application from Azure - can't recommend Azure btw). So the first limitation, being the query size, can be circumvented by providing a chunksize
argument. However, this won't enable you to write a dataframe the size of 10**7 or larger, (at least not on the VM I am working with which has ~55GB RAM), being issue nr 2.
1) 假设您正在写入远程 SQL 存储。当您尝试使用该to_sql
方法编写大型 Pandas DataFrame 时,它会将整个数据帧转换为值列表。这种转换比原始 DataFrame 占用更多的 RAM(在它之上,因为旧的 DataFrame 仍然存在于 RAM 中)。此列表提供给executemany
ODBC 连接器的最终调用。我认为 ODBC 连接器在处理如此大的查询时遇到了一些麻烦。解决这个问题的to_sql
一种方法是为方法提供一个块大小参数(10**5 似乎是最佳的,在来自 Azure 的 2 CPU 7GB ram MSSQL 存储应用程序上提供大约 600 mbit/s (!) 写入速度 - 不能推荐天蓝色顺便说一句)。所以第一个限制,即查询大小,可以通过提供一个chunksize
争论。但是,这不会使您能够编写大小为 10**7 或更大的数据帧(至少不是在我正在使用的具有 ~55GB RAM 的 VM 上),这是问题 nr 2。
This can be circumvented by breaking up the DataFrame with np.split
(being 10**6 size DataFrame chunks) These can be written away iteratively. I will try to make a pull request when I have a solution ready for the to_sql
method in the core of pandas itself so you won't have to do this pre-breaking up every time. Anyhow I ended up writing a function similar (not turn-key) to the following:
这可以通过将 DataFrame 分解为np.split
(10**6 大小的 DataFrame 块)来规避,这些可以迭代地写掉。当我to_sql
为 Pandas 核心中的方法准备好解决方案时,我将尝试发出拉取请求,这样您就不必每次都进行预分解。无论如何,我最终编写了一个类似于(不是交钥匙)的函数,如下所示:
import pandas as pd
import numpy as np
def write_df_to_sql(df, **kwargs):
chunks = np.split(df, df.shape()[0] / 10**6)
for chunk in chunks:
chunk.to_sql(**kwargs)
return True
A more complete example of the above snippet can be viewed here: https://gitlab.com/timelord/timelord/blob/master/timelord/utils/connector.py
可以在此处查看上述代码段的更完整示例:https: //gitlab.com/timelord/timelord/blob/master/timelord/utils/connector.py
It's a class I wrote that incorporates the patch and eases some of the necessary overhead that comes with setting up connections with SQL. Still have to write some documentation. Also I was planning on contributing the patch to pandas itself but haven't found a nice way yet on how to do so.
这是我编写的一个类,它包含了补丁并减轻了设置 SQL 连接所带来的一些必要开销。还是得写一些文档。我也计划将补丁贡献给熊猫本身,但还没有找到一个很好的方法来做到这一点。
I hope this helps.
我希望这有帮助。
回答by Emmanuel
I ran into the same problem but using PostgreSQL. They now just release pandas version 0.24.0and there is a new parameter in the to_sql
function called method
which solved my problem.
我遇到了同样的问题,但使用的是 PostgreSQL。他们现在刚刚发布了0.24.0 版的 Pandas,并且在被to_sql
调用的函数中有一个新参数method
解决了我的问题。
from sqlalchemy import create_engine
engine = create_engine(your_options)
data_frame.to_sql(table_name, engine, method="multi")
Upload speed is 100x faster for me.
I also recommend setting the chunksize
parameter if you are going to send lots of data.
上传速度对我来说快 100 倍。chunksize
如果您要发送大量数据,我还建议设置该参数。
回答by Pylander
I just wanted to post this full example as an additional, high-performance option for those who can use the new turbodbc library: http://turbodbc.readthedocs.io/en/latest/
我只是想将这个完整的例子作为一个额外的、高性能的选项发布给那些可以使用新的 turbodbc 库的人:http://turbodbc.readthedocs.io/en/latest/
There clearly are many options in flux between pandas .to_sql(), triggering fast_executemany through sqlalchemy, using pyodbc directly with tuples/lists/etc., or even trying BULK UPLOAD with flat files.
pandas .to_sql() 之间的变化显然有很多选择,通过 sqlalchemy 触发 fast_executemany,直接使用 pyodbc 与元组/列表/等,甚至尝试使用平面文件批量上传。
Hopefully, the following might make life a bit more pleasant as functionality evolves in the current pandas project or includes something like turbodbc integration in the future.
希望随着当前 Pandas 项目中功能的发展或将来包含诸如 turbodbc 集成之类的内容,以下内容可能会使生活更加愉快。
import pandas as pd
import numpy as np
from turbodbc import connect, make_options
from io import StringIO
test_data = '''id,transaction_dt,units,measures
1,2018-01-01,4,30.5
1,2018-01-03,4,26.3
2,2018-01-01,3,12.7
2,2018-01-03,3,8.8'''
df_test = pd.read_csv(StringIO(test_data), sep=',')
df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt'])
options = make_options(parameter_sets_to_buffer=1000)
conn = connect(driver='{SQL Server}', server='server_nm', database='db_nm', turbodbc_options=options)
test_query = '''DROP TABLE IF EXISTS [db_name].[schema].[test]
CREATE TABLE [db_name].[schema].[test]
(
id int NULL,
transaction_dt datetime NULL,
units int NULL,
measures float NULL
)
INSERT INTO [db_name].[schema].[test] (id,transaction_dt,units,measures)
VALUES (?,?,?,?) '''
cursor.executemanycolumns(test_query, [df_test['id'].values, df_test['transaction_dt'].values, df_test['units'].values, df_test['measures'].values]
turbodbc should be VERY fast in many use cases (particularly with numpy arrays). Please observe how straightforward it is to pass the underlying numpy arrays from the dataframe columns as parameters to the query directly. I also believe this helps prevent the creation of intermediate objects that spike memory consumption excessively. Hope this is helpful!
在许多用例中,turbodbc 应该非常快(特别是对于 numpy 数组)。请注意将数据帧列中的底层 numpy 数组作为参数直接传递给查询是多么简单。我也相信这有助于防止创建过度消耗内存的中间对象。希望这是有帮助的!
回答by Ilja Everil?
It seems that Pandas 0.23.0 and 0.24.0 use multi values insertswith PyODBC, which prevents fast executemany from helping – a single INSERT ... VALUES ...
statement is emitted per chunk. The multi values insert chunks are an improvement over the old slow executemany default, but at least in simple tests the fast executemany method still prevails, not to mention no need for manual chunksize
calculations, as is required with multi values inserts. Forcing the old behaviour can be done by monkeypatching, if no configuration option is provided in the future:
似乎 Pandas 0.23.0 和 0.24.0使用PyODBC 的多值插入,这阻止了快速执行 -INSERT ... VALUES ...
每个块发出一条语句。多值插入块是对旧的慢执行多默认值的改进,但至少在简单测试中,快速执行多方法仍然占上风,更不用说不需要手动chunksize
计算了,因为多值插入是必需的。如果将来没有提供配置选项,可以通过monkeypatching 强制执行旧行为:
import pandas.io.sql
def insert_statement(self, data, conn):
return self.table.insert(), data
pandas.io.sql.SQLTable.insert_statement = insert_statement
The future is here and at least in the master
branch the insert method can be controlled using the keyword argument method=
of to_sql()
. It defaults to None
, which forces the executemany method. Passing method='multi'
results in using the multi values insert. It can even be used to implement DBMS specific approaches, such as Postgresql COPY
.
未来在这里和至少在master
分支可以使用关键字参数来控制所述插入方法method=
的to_sql()
。它默认为None
,这会强制执行 executemany 方法。method='multi'
使用多值插入传递结果。它甚至可以用于实现 DBMS 特定的方法,例如 Postgresql COPY
。
回答by erickfis
As pointed out by @Pylander
正如@Pylander 指出的那样
Turbodbc is the best choice for data ingestion, by far!
到目前为止,Turbodbc 是数据摄取的最佳选择!
I got so excited about it that I wrote a 'blog' on it on my github and medium: please check https://medium.com/@erickfis/etl-process-with-turbodbc-1d19ed71510e
我对此感到非常兴奋,以至于我在我的 github 和媒体上写了一篇关于它的“博客”:请查看https://medium.com/@erickfis/etl-process-with-turbodbc-1d19ed71510e
for a working example and comparison with pandas.to_sql
有关工作示例并与 pandas.to_sql 进行比较
Long story short,
长话短说,
with turbodbc I've got 10000 lines (77 columns) in 3 seconds
使用 turbodbc 我在 3 秒内得到 10000 行(77 列)
with pandas.to_sql I've got the same 10000 lines (77 columns) in 198 seconds...
使用 pandas.to_sql 我在 198 秒内得到了相同的 10000 行(77 列)...
And here is what I'm doing in full detail
这是我正在做的全部细节
The imports:
进口:
import sqlalchemy
import pandas as pd
import numpy as np
import turbodbc
import time
Load and treat some data - Substitute my sample.pkl for yours:
加载并处理一些数据 - 用我的 sample.pkl 代替你的:
df = pd.read_pickle('sample.pkl')
df.columns = df.columns.str.strip() # remove white spaces around column names
df = df.applymap(str.strip) # remove white spaces around values
df = df.replace('', np.nan) # map nans, to drop NAs rows and columns later
df = df.dropna(how='all', axis=0) # remove rows containing only NAs
df = df.dropna(how='all', axis=1) # remove columns containing only NAs
df = df.replace(np.nan, 'NA') # turbodbc hates null values...
Create the table using sqlAlchemy
使用 sqlAlchemy 创建表
Unfortunately, turbodbc requires a lot of overhead with a lot of sql manual labor, for creating the tables and for inserting data on it.
不幸的是,turbodbc 需要大量开销和大量 sql 手工劳动,用于创建表和在其上插入数据。
Fortunately, Python is pure joy and we can automate this process of writing sql code.
幸运的是,Python 是纯粹的乐趣,我们可以自动化编写 sql 代码的这个过程。
The first step is creating the table which will receive our data. However, creating the table manually writing sql code can be problematic if your table has more than a few columns. In my case, very often the tables have 240 columns!
第一步是创建将接收我们的数据的表。但是,如果您的表有多于几列,则手动编写 sql 代码创建表可能会出现问题。就我而言,表格通常有 240 列!
This is where sqlAlchemy and pandas still can help us: pandas is bad for writing a large number of rows (10000 in this example), but what about just 6 rows, the head of the table? This way, we automate the process of creating the tables.
这就是 sqlAlchemy 和 pandas 仍然可以帮助我们的地方:pandas 不适合写入大量行(在此示例中为 10000),但是只有 6 行,即表头呢?通过这种方式,我们可以自动化创建表的过程。
Create sqlAlchemy connection:
创建 sqlAlchemy 连接:
mydb = 'someDB'
def make_con(db):
"""Connect to a specified db."""
database_connection = sqlalchemy.create_engine(
'mssql+pymssql://{0}:{1}@{2}/{3}'.format(
myuser, mypassword,
myhost, db
)
)
return database_connection
pd_connection = make_con(mydb)
Create table on SQL Server
在 SQL Server 上创建表
Using pandas + sqlAlchemy, but just for preparing room for turbodbc as previously mentioned. Please note that df.head() here: we are using pandas + sqlAlchemy for inserting only 6 rows of our data. This will run pretty fast and is being done to automate the table creation.
使用 pandas + sqlAlchemy,但只是为了为前面提到的 turbodbc 准备空间。请注意这里的 df.head():我们使用 pandas + sqlAlchemy 仅插入 6 行数据。这将运行得非常快,并且正在完成以自动化表创建。
table = 'testing'
df.head().to_sql(table, con=pd_connection, index=False)
Now that the table is already in place, let's get serious here.
现在桌子已经就位,让我们认真对待。
Turbodbc connection:
Turbodbc 连接:
def turbo_conn(mydb):
"""Connect to a specified db - turbo."""
database_connection = turbodbc.connect(
driver='ODBC Driver 17 for SQL Server',
server=myhost,
database=mydb,
uid=myuser,
pwd=mypassword
)
return database_connection
Preparing sql comands and data for turbodbc. Let's automate this code creation being creative:
为 turbodbc 准备 sql 命令和数据。让我们自动化这个创造性的代码创建:
def turbo_write(mydb, df, table):
"""Use turbodbc to insert data into sql."""
start = time.time()
# preparing columns
colunas = '('
colunas += ', '.join(df.columns)
colunas += ')'
# preparing value place holders
val_place_holder = ['?' for col in df.columns]
sql_val = '('
sql_val += ', '.join(val_place_holder)
sql_val += ')'
# writing sql query for turbodbc
sql = f"""
INSERT INTO {mydb}.dbo.{table} {colunas}
VALUES {sql_val}
"""
# writing array of values for turbodbc
valores_df = [df[col].values for col in df.columns]
# cleans the previous head insert
with connection.cursor() as cursor:
cursor.execute(f"delete from {mydb}.dbo.{table}")
connection.commit()
# inserts data, for real
with connection.cursor() as cursor:
try:
cursor.executemanycolumns(sql, valores_df)
connection.commit()
except Exception:
connection.rollback()
print('something went wrong')
stop = time.time() - start
return print(f'finished in {stop} seconds')
Writing data using turbodbc - I've got 10000 lines (77 columns) in 3 seconds:
使用 turbodbc 写入数据 - 我在 3 秒内有 10000 行(77 列):
turbo_write(mydb, df.sample(10000), table)
Pandas method comparison - I've got the same 10000 lines (77 columns) in 198 seconds…
Pandas 方法比较 - 我在 198 秒内得到了相同的 10000 行(77 列)......
table = 'pd_testing'
def pandas_comparisson(df, table):
"""Load data using pandas."""
start = time.time()
df.to_sql(table, con=pd_connection, index=False)
stop = time.time() - start
return print(f'finished in {stop} seconds')
pandas_comparisson(df.sample(10000), table)
Environment and conditions
环境条件
Python 3.6.7 :: Anaconda, Inc.
TURBODBC version ‘3.0.0'
sqlAlchemy version ‘1.2.12'
pandas version ‘0.23.4'
Microsoft SQL Server 2014
user with bulk operations privileges
Please check https://erickfis.github.io/loose-code/for updates in this code!
请检查https://erickfis.github.io/loose-code/以获取此代码中的更新!
回答by Gord Thompson
SQL Server INSERT performance: pyodbc vs. turbodbc
SQL Server INSERT 性能:pyodbc 与 turbodbc
When using to_sql
to upload a pandas DataFrame to SQL Server, turbodbc will definitely be faster than pyodbc without fast_executemany
. However, with fast_executemany
enabled for pyodbc, both approaches yield essentially the same performance.
当使用to_sql
将 Pandas DataFrame 上传到 SQL Server 时,turbodbc 肯定会比没有fast_executemany
. 但是,fast_executemany
启用 pyodbc 后,两种方法产生的性能基本相同。
Test environments:
测试环境:
[venv1_pyodbc]
pyodbc 2.0.25
[venv1_pyodbc]
pyodbc 2.0.25
[venv2_turbodbc]
turbodbc 3.0.0
sqlalchemy-turbodbc 0.1.0
[venv2_turbodbc]
turbodbc 3.0.0
sqlalchemy-turbodbc 0.1.0
[common to both]
Python 3.6.4 64-bit on Windows
SQLAlchemy 1.3.0b1
pandas 0.23.4
numpy 1.15.4
[两者通用]
Windows 上的 Python 3.6.4 64 位
SQLAlchemy 1.3.0b1
pandas 0.23.4
numpy 1.15.4
Test code:
测试代码:
# for pyodbc
engine = create_engine('mssql+pyodbc://sa:whatever@SQL_panorama', fast_executemany=True)
# for turbodbc
# engine = create_engine('mssql+turbodbc://sa:whatever@SQL_panorama')
# test data
num_rows = 10000
num_cols = 100
df = pd.DataFrame(
[[f'row{x:04}col{y:03}' for y in range(num_cols)] for x in range(num_rows)],
columns=[f'col{y:03}' for y in range(num_cols)]
)
t0 = time.time()
df.to_sql("sqlalchemy_test", engine, if_exists='replace', index=None)
print(f"pandas wrote {num_rows} rows in {(time.time() - t0):0.1f} seconds")
Tests were run twelve (12) times for each environment, discarding the single best and worst times for each. Results (in seconds):
测试对每个环境运行十二 (12) 次,丢弃每个环境的单个最佳和最差时间。结果(以秒为单位):
rank pyodbc turbodbc
---- ------ --------
1 22.8 27.5
2 23.4 28.1
3 24.6 28.2
4 25.2 28.5
5 25.7 29.3
6 26.9 29.9
7 27.0 31.4
8 30.1 32.1
9 33.6 32.5
10 39.8 32.9
---- ------ --------
average 27.9 30.0
回答by Azamat Bekkhozha
Just wanted to add to the @J.K.'s answer.
只是想添加到@JK 的答案中。
If you are using this approach:
如果您使用这种方法:
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
And you are getting this error:
你收到这个错误:
"sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('HY010', '[HY010] [Microsoft][SQL Server Native Client 11.0]Function sequence error (0) (SQLParamData)') [SQL: 'INSERT INTO ... (...) VALUES (?, ?)'] [parameters: ((..., ...), (..., ...)] (Background on this error at: http://sqlalche.me/e/dbapi)"
"sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('HY010', '[HY010] [Microsoft][SQL Server Native Client 11.0] 函数序列错误 (0) (SQLParamData)') [SQL: 'INSERT INTO .. . (...) VALUES (?, ?)'] [参数: ((..., ...), (..., ...)] (此错误的背景: http://sqlalche .me/e/dbapi)"
Encode your string values like this: 'yourStringValue'.encode('ascii')
像这样编码你的字符串值: 'yourStringValue'.encode('ascii')
This will solve your problem.
这将解决您的问题。
回答by Rajat Seth
I just modify engine line which helps me to speedup the insertion 100 times.
我只是修改了引擎线,这有助于我将插入速度提高 100 倍。
Old Code -
旧代码 -
import json
import maya
import time
import pandas
import pyodbc
import pandas as pd
from sqlalchemy import create_engine
retry_count = 0
retry_flag = True
hostInfoDf = pandas.read_excel('test.xlsx', sheet_name='test')
print("Read Ok")
engine = create_engine("mssql+pyodbc://server_name/db_name?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server")
while retry_flag and retry_count < 5:
try:
df.to_sql("table_name",con=engine,if_exists="replace",index=False,chunksize=5000,schema="dbo")
retry_flag = False
except:
retry_count = retry_count + 1
time.sleep(30)
Modified engine line -
改装发动机线——
From -
从 -
engine = create_engine("mssql+pyodbc://server_name/db_name?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server")
to -
到 -
engine = create_engine("mssql+pyodbc://server_name/db_name?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server", fast_executemany=True)
ask me any Query related python to SQL connectivity, I will be happy to help you.
问我任何与 SQL 连接相关的 Python 查询,我很乐意为您提供帮助。