python pandas to_sql with sqlalchemy:如何加快导出到MS SQL的速度?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29706278/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:54:32  来源:igfitidea点击:

python pandas to_sql with sqlalchemy : how to speed up exporting to MS SQL?

pythonsqlpandassqlalchemypyodbc

提问by Pythonista anonymous

I have a dataframe with ca 155,000 rows and 12 columns. If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).

我有一个大约 155,000 行和 12 列的数据框。如果我使用 dataframe.to_csv 将其导出到 csv ,则输出是一个 11MB 的文件(立即生成)。

If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes! No columns are text: only int, float, bool and dates. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.

但是,如果我使用 to_sql 方法导出到 Microsoft SQL Server,则需要 5 到 6 分钟!没有列是文本:只有 int、float、bool 和日期。我见过 ODBC 驱动程序设置 nvarchar(max) 并且这会减慢数据传输速度的情况,但这里不是这种情况。

Any suggestions on how to speed up the export process? Taking 6 minutes to export 11 MBs of data makes the ODBC connection practically unusable.

关于如何加快出口过程的任何建议?导出 11 MB 的数据需要 6 分钟,这使得 ODBC 连接几乎无法使用。

Thanks!

谢谢!

My code is:

我的代码是:

import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"

engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)
conn = engine.connect()

metadata = MetaData(conn)

my_data_frame.to_sql(TableName,engine)

回答by firelynx

The DataFrame.to_sqlmethod generates insert statements to your ODBC connector which then is treated by the ODBC connector as regular inserts.

DataFrame.to_sql方法为您的 ODBC 连接器生成插入语句,然后 ODBC 连接器将其视为常规插入。

When this is slow, it is not the fault of pandas.

当这很慢时,这不是熊猫的错。

Saving the output of the DataFrame.to_sqlmethod to a file, then replaying that file over an ODBC connector will take the same amount of time.

DataFrame.to_sql方法的输出保存到文件,然后通过 ODBC 连接器重放该文件将花费相同的时间。

The proper way of bulk importing data into a database is to generate a csv file and then use a load command, which in the MS flavour of SQL databases is called BULK INSERT

将数据批量导入数据库的正确方法是生成一个 csv 文件,然后使用加载命令,该命令在 SQL 数据库的 MS 风格中称为 BULK INSERT

For example:

例如:

BULK INSERT mydatabase.myschema.mytable
FROM 'mydatadump.csv';

The syntax reference is as follows:

语法参考如下:

BULK INSERT 
   [ database_name . [ schema_name ] . | schema_name . ] [ table_name | view_name ] 
      FROM 'data_file' 
     [ WITH 
    ( 
   [ [ , ] BATCHSIZE = batch_size ] 
   [ [ , ] CHECK_CONSTRAINTS ] 
   [ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ] 
   [ [ , ] DATAFILETYPE = 
      { 'char' | 'native'| 'widechar' | 'widenative' } ] 
   [ [ , ] FIELDTERMINATOR = 'field_terminator' ] 
   [ [ , ] FIRSTROW = first_row ] 
   [ [ , ] FIRE_TRIGGERS ] 
   [ [ , ] FORMATFILE = 'format_file_path' ] 
   [ [ , ] KEEPIDENTITY ] 
   [ [ , ] KEEPNULLS ] 
   [ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ] 
   [ [ , ] LASTROW = last_row ] 
   [ [ , ] MAXERRORS = max_errors ] 
   [ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ] 
   [ [ , ] ROWS_PER_BATCH = rows_per_batch ] 
   [ [ , ] ROWTERMINATOR = 'row_terminator' ] 
   [ [ , ] TABLOCK ] 
   [ [ , ] ERRORFILE = 'file_name' ] 
    )] 

回答by citynorman

You can use d6tstackwhich has fast pandas to SQL functionalitybecause it uses native DB import commands. It supports MS SQL, Postgres and MYSQL

您可以使用d6tstack,它具有快速的Pandas到 SQL 功能,因为它使用本机 DB 导入命令。它支持 MS SQL、Postgres 和 MYSQL

uri_psql = 'postgresql+psycopg2://usr:pwd@localhost/db'
d6tstack.utils.pd_to_psql(df, uri_psql, 'table')
uri_mssql = 'mssql+pymssql://usr:pwd@localhost/db'
d6tstack.utils.pd_to_mssql(df, uri_mssql, 'table', 'schema') # experimental

Also useful for importing multiple CSV with data schema changes and/or preprocess with pandas before writing to db, see further down in examples notebook

在写入数据库之前导入多个带有数据模式更改的 CSV 和/或使用 Pandas 进行预处理也很有用,请在示例笔记本中进一步查看

d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), 
    apply_after_read=apply_fun).to_psql_combine(uri_psql, 'table')

回答by Roee Anuar

Based on this answer - Aseem.

基于此答案-Aseem

You can use the copy_from method to simulate a bulk load with a cursor object. This was tested on Postgres, try it with your DB:

您可以使用 copy_from 方法来模拟带有游标对象的批量加载。这是在 Postgres 上测试过的,用你的数据库试试:

import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
from StringIO import StringIO

ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"

engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database) #don't forget to add a password if needed

my_data_frame.head(0).to_sql(TableName, engine, if_exists='replace', index=False)  # create an empty table - just for structure
conn = engine.raw_connection()
cur = conn.cursor()
output = StringIO()
my_data_frame.to_csv(output, sep='\t', header=False, index=False) # a CSV that will be used for the bulk load
output.seek(0)
cur.copy_from(output, TableName, null="")  # null values become ''
conn.commit()
conn.close()
cur.close()

回答by Joe

My solution to this problem is below if this helps anyone. From what I've read, pandas tosql method loads one record at a time.

如果这对任何人有帮助,我对此问题的解决方案如下。根据我的阅读,pandas tosql 方法一次加载一条记录。

You can make a bulk insert statement that loads 1000 lines and commits that transaction instead of committing a single row each time. This increases the speed massively.

您可以创建一个批量插入语句,加载 1000 行并提交该事务,而不是每次提交一行。这大大提高了速度。

import pandas as pd
from sqlalchemy import create_engine
import pymssql
import os

connect_string  = [your connection string]
engine = create_engine(connect_string,echo=False)
connection = engine.raw_connection()
cursor = connection.cursor()

def load_data(report_name):
    # my report_name variable is also my sql server table name so I use that variable to create table name string
    sql_table_name = 'AR_'+str(report_name)
    global chunk # to QC chunks that fail for some reason
    for chunk in pd.read_csv(report_full_path_new,chunksize=1000):
        chunk.replace('\'','\'\'',inplace=True,regex=True) #replace single quotes in data with double single quotes to escape it in mysql
        chunk.fillna('NULL',inplace=True)

        my_data = str(chunk.to_records(index=False).tolist()) # convert data to string 
        my_data = my_data[1:-1] # clean up the ends
        my_data = my_data.replace('\"','\'').replace('\'NULL\'','NULL') #convert blanks to NULLS for mysql
        sql_table_name = [your sql server table name]

        sql = """
        INSERT INTO {0} 
        VALUES {1}

         """.format(sql_table_name,my_data)

        cursor.execute(sql)
        # you must call commit() to persist your data if you don't set autocommit to True
        connection.commit()

回答by NemesisMF

I recently had the same problem and feel like to add an answer to this for others. to_sqlseems to send an INSERTquery for every row which makes it really slow. But since 0.24.0there is a methodparameter in pandas.to_sql()where you can define your own insertion function or just use method='multi'to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.

我最近遇到了同样的问题,想为其他人添加一个答案。 to_sql似乎INSERT为每一行发送一个查询,这使得它真的很慢。但是因为0.24.0有一个method参数pandas.to_sql(),你可以在其中定义自己的插入函数,或者只是method='multi'用来告诉 Pandas 在单个 INSERT 查询中传递多行,这使得它更快。

Note that your Database may has a parameter limit. In that case you also have to define a chunksize.

请注意,您的数据库可能有参数限制。在这种情况下,您还必须定义块大小。

So the solution should simply look like to this:

所以解决方案应该看起来像这样:

my_data_frame.to_sql(TableName, engine, chunksize=<yourParameterLimit>, method='multi')

If you do not know your database parameter limit, just try it without the chunksize parameter. It will run or give you an error telling you your limit.

如果您不知道您的数据库参数限制,请在没有 chunksize 参数的情况下尝试。它会运行或给你一个错误告诉你你的极限。

回答by glenn6452

You can use this: what makes it faster is the methodparameter of pandas to_sql. I hope this help helps.

您可以使用它:使它更快的是methodpandas的参数to_sql。我希望这有帮助。

The result of this on my experience was from infinite time to 8 secs.

根据我的经验,结果是从无限时间到 8 秒。


df = pd.read_csv('test.csv')

conn = create_engine(<connection_string>)

start_time = time.time()
df.to_sql('table_name', conn, method='multi',index=False, if_exists='replace')
print("--- %s seconds ---" % (time.time() - start_time))

回答by anatoly

I was running out of time and memory (more than 18GB allocated for a DataFrame loaded from 120MB CSV) with this line:

我的时间和内存都用完了(为从 120MB CSV 加载的 DataFrame 分配了超过 18GB)这一行:

df.to_sql('my_table', engine, if_exists='replace', method='multi', dtype={"text_field": db.String(64), "text_field2": db.String(128), "intfield1": db.Integer(), "intfield2": db.Integer(), "floatfield": db.Float()})

Here is the code that helped me to import and track progress of insertions at the same time:

这是帮助我同时导入和跟踪插入进度的代码:

import sqlalchemy as db
engine = db.create_engine('mysql://user:password@localhost:3306/database_name', echo=False)
connection = engine.connect()
metadata = db.MetaData()

my_table = db.Table('my_table', metadata,
              db.Column('text_field', db.String(64), index=True),
              db.Column('text_field2', db.String(128), index=True),
              db.Column('intfield1', db.Integer()),
              db.Column('intfield2', db.Integer()),
              db.Column('floatfield', db.Float())
             )
metadata.create_all(engine)
kw_dict = df.reset_index().sort_values(by="intfield2", ascending=False).to_dict(orient="records")

batch_size=10000
for batch_start in range(0, len(kw_dict), batch_size):
    print("Inserting {}-{}".format(batch_start, batch_start + batch_size))
    connection.execute(my_table.insert(), kw_dict[batch_start:batch_start + batch_size])