python pandas to_sql with sqlalchemy:如何加快导出到MS SQL的速度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29706278/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python pandas to_sql with sqlalchemy : how to speed up exporting to MS SQL?
提问by Pythonista anonymous
I have a dataframe with ca 155,000 rows and 12 columns. If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).
我有一个大约 155,000 行和 12 列的数据框。如果我使用 dataframe.to_csv 将其导出到 csv ,则输出是一个 11MB 的文件(立即生成)。
If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes! No columns are text: only int, float, bool and dates. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.
但是,如果我使用 to_sql 方法导出到 Microsoft SQL Server,则需要 5 到 6 分钟!没有列是文本:只有 int、float、bool 和日期。我见过 ODBC 驱动程序设置 nvarchar(max) 并且这会减慢数据传输速度的情况,但这里不是这种情况。
Any suggestions on how to speed up the export process? Taking 6 minutes to export 11 MBs of data makes the ODBC connection practically unusable.
关于如何加快出口过程的任何建议?导出 11 MB 的数据需要 6 分钟,这使得 ODBC 连接几乎无法使用。
Thanks!
谢谢!
My code is:
我的代码是:
import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"
engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)
conn = engine.connect()
metadata = MetaData(conn)
my_data_frame.to_sql(TableName,engine)
回答by firelynx
The DataFrame.to_sql
method generates insert statements to your ODBC connector which then is treated by the ODBC connector as regular inserts.
该DataFrame.to_sql
方法为您的 ODBC 连接器生成插入语句,然后 ODBC 连接器将其视为常规插入。
When this is slow, it is not the fault of pandas.
当这很慢时,这不是熊猫的错。
Saving the output of the DataFrame.to_sql
method to a file, then replaying that file over an ODBC connector will take the same amount of time.
将DataFrame.to_sql
方法的输出保存到文件,然后通过 ODBC 连接器重放该文件将花费相同的时间。
The proper way of bulk importing data into a database is to generate a csv file and then use a load command, which in the MS flavour of SQL databases is called BULK INSERT
将数据批量导入数据库的正确方法是生成一个 csv 文件,然后使用加载命令,该命令在 SQL 数据库的 MS 风格中称为 BULK INSERT
For example:
例如:
BULK INSERT mydatabase.myschema.mytable
FROM 'mydatadump.csv';
The syntax reference is as follows:
语法参考如下:
BULK INSERT
[ database_name . [ schema_name ] . | schema_name . ] [ table_name | view_name ]
FROM 'data_file'
[ WITH
(
[ [ , ] BATCHSIZE = batch_size ]
[ [ , ] CHECK_CONSTRAINTS ]
[ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
[ [ , ] DATAFILETYPE =
{ 'char' | 'native'| 'widechar' | 'widenative' } ]
[ [ , ] FIELDTERMINATOR = 'field_terminator' ]
[ [ , ] FIRSTROW = first_row ]
[ [ , ] FIRE_TRIGGERS ]
[ [ , ] FORMATFILE = 'format_file_path' ]
[ [ , ] KEEPIDENTITY ]
[ [ , ] KEEPNULLS ]
[ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ]
[ [ , ] LASTROW = last_row ]
[ [ , ] MAXERRORS = max_errors ]
[ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ]
[ [ , ] ROWS_PER_BATCH = rows_per_batch ]
[ [ , ] ROWTERMINATOR = 'row_terminator' ]
[ [ , ] TABLOCK ]
[ [ , ] ERRORFILE = 'file_name' ]
)]
回答by citynorman
You can use d6tstackwhich has fast pandas to SQL functionalitybecause it uses native DB import commands. It supports MS SQL, Postgres and MYSQL
您可以使用d6tstack,它具有快速的Pandas到 SQL 功能,因为它使用本机 DB 导入命令。它支持 MS SQL、Postgres 和 MYSQL
uri_psql = 'postgresql+psycopg2://usr:pwd@localhost/db'
d6tstack.utils.pd_to_psql(df, uri_psql, 'table')
uri_mssql = 'mssql+pymssql://usr:pwd@localhost/db'
d6tstack.utils.pd_to_mssql(df, uri_mssql, 'table', 'schema') # experimental
Also useful for importing multiple CSV with data schema changes and/or preprocess with pandas before writing to db, see further down in examples notebook
在写入数据库之前导入多个带有数据模式更改的 CSV 和/或使用 Pandas 进行预处理也很有用,请在示例笔记本中进一步查看
d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'),
apply_after_read=apply_fun).to_psql_combine(uri_psql, 'table')
回答by Roee Anuar
Based on this answer - Aseem.
基于此答案-Aseem。
You can use the copy_from method to simulate a bulk load with a cursor object. This was tested on Postgres, try it with your DB:
您可以使用 copy_from 方法来模拟带有游标对象的批量加载。这是在 Postgres 上测试过的,用你的数据库试试:
import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
from StringIO import StringIO
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"
engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database) #don't forget to add a password if needed
my_data_frame.head(0).to_sql(TableName, engine, if_exists='replace', index=False) # create an empty table - just for structure
conn = engine.raw_connection()
cur = conn.cursor()
output = StringIO()
my_data_frame.to_csv(output, sep='\t', header=False, index=False) # a CSV that will be used for the bulk load
output.seek(0)
cur.copy_from(output, TableName, null="") # null values become ''
conn.commit()
conn.close()
cur.close()
回答by Joe
My solution to this problem is below if this helps anyone. From what I've read, pandas tosql method loads one record at a time.
如果这对任何人有帮助,我对此问题的解决方案如下。根据我的阅读,pandas tosql 方法一次加载一条记录。
You can make a bulk insert statement that loads 1000 lines and commits that transaction instead of committing a single row each time. This increases the speed massively.
您可以创建一个批量插入语句,加载 1000 行并提交该事务,而不是每次提交一行。这大大提高了速度。
import pandas as pd
from sqlalchemy import create_engine
import pymssql
import os
connect_string = [your connection string]
engine = create_engine(connect_string,echo=False)
connection = engine.raw_connection()
cursor = connection.cursor()
def load_data(report_name):
# my report_name variable is also my sql server table name so I use that variable to create table name string
sql_table_name = 'AR_'+str(report_name)
global chunk # to QC chunks that fail for some reason
for chunk in pd.read_csv(report_full_path_new,chunksize=1000):
chunk.replace('\'','\'\'',inplace=True,regex=True) #replace single quotes in data with double single quotes to escape it in mysql
chunk.fillna('NULL',inplace=True)
my_data = str(chunk.to_records(index=False).tolist()) # convert data to string
my_data = my_data[1:-1] # clean up the ends
my_data = my_data.replace('\"','\'').replace('\'NULL\'','NULL') #convert blanks to NULLS for mysql
sql_table_name = [your sql server table name]
sql = """
INSERT INTO {0}
VALUES {1}
""".format(sql_table_name,my_data)
cursor.execute(sql)
# you must call commit() to persist your data if you don't set autocommit to True
connection.commit()
回答by NemesisMF
I recently had the same problem and feel like to add an answer to this for others.
to_sql
seems to send an INSERT
query for every row which makes it really slow. But since 0.24.0
there is a method
parameter in pandas.to_sql()
where you can define your own insertion function or just use method='multi'
to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.
我最近遇到了同样的问题,想为其他人添加一个答案。
to_sql
似乎INSERT
为每一行发送一个查询,这使得它真的很慢。但是因为0.24.0
有一个method
参数pandas.to_sql()
,你可以在其中定义自己的插入函数,或者只是method='multi'
用来告诉 Pandas 在单个 INSERT 查询中传递多行,这使得它更快。
Note that your Database may has a parameter limit. In that case you also have to define a chunksize.
请注意,您的数据库可能有参数限制。在这种情况下,您还必须定义块大小。
So the solution should simply look like to this:
所以解决方案应该看起来像这样:
my_data_frame.to_sql(TableName, engine, chunksize=<yourParameterLimit>, method='multi')
If you do not know your database parameter limit, just try it without the chunksize parameter. It will run or give you an error telling you your limit.
如果您不知道您的数据库参数限制,请在没有 chunksize 参数的情况下尝试。它会运行或给你一个错误告诉你你的极限。
回答by glenn6452
You can use this: what makes it faster is the method
parameter of pandas to_sql
. I hope this help helps.
您可以使用它:使它更快的是method
pandas的参数to_sql
。我希望这有帮助。
The result of this on my experience was from infinite time to 8 secs.
根据我的经验,结果是从无限时间到 8 秒。
df = pd.read_csv('test.csv')
conn = create_engine(<connection_string>)
start_time = time.time()
df.to_sql('table_name', conn, method='multi',index=False, if_exists='replace')
print("--- %s seconds ---" % (time.time() - start_time))
回答by anatoly
I was running out of time and memory (more than 18GB allocated for a DataFrame loaded from 120MB CSV) with this line:
我的时间和内存都用完了(为从 120MB CSV 加载的 DataFrame 分配了超过 18GB)这一行:
df.to_sql('my_table', engine, if_exists='replace', method='multi', dtype={"text_field": db.String(64), "text_field2": db.String(128), "intfield1": db.Integer(), "intfield2": db.Integer(), "floatfield": db.Float()})
Here is the code that helped me to import and track progress of insertions at the same time:
这是帮助我同时导入和跟踪插入进度的代码:
import sqlalchemy as db
engine = db.create_engine('mysql://user:password@localhost:3306/database_name', echo=False)
connection = engine.connect()
metadata = db.MetaData()
my_table = db.Table('my_table', metadata,
db.Column('text_field', db.String(64), index=True),
db.Column('text_field2', db.String(128), index=True),
db.Column('intfield1', db.Integer()),
db.Column('intfield2', db.Integer()),
db.Column('floatfield', db.Float())
)
metadata.create_all(engine)
kw_dict = df.reset_index().sort_values(by="intfield2", ascending=False).to_dict(orient="records")
batch_size=10000
for batch_start in range(0, len(kw_dict), batch_size):
print("Inserting {}-{}".format(batch_start, batch_start + batch_size))
connection.execute(my_table.insert(), kw_dict[batch_start:batch_start + batch_size])