使用 SqlAlchemy 和 cx_Oracle 将 Pandas DataFrame 写入 Oracle 数据库时加快 to_sql()

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42727990/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:09:21  来源:igfitidea点击:

Speed up to_sql() when writing Pandas DataFrame to Oracle database using SqlAlchemy and cx_Oracle

oracleperformancepandasdataframesqlalchemy

提问by breezymri

Using pandas dataframe's to_sql method, I can write a small number of rows to a table in oracle database pretty easily:

使用 pandas 数据框的 to_sql 方法,我可以很容易地将少量行写入 oracle 数据库中的表中:

from sqlalchemy import create_engine
import cx_Oracle
dsn_tns = "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<host>)(PORT=1521))\
       (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=<servicename>)))"
pwd = input('Please type in password:')
engine = create_engine('oracle+cx_oracle://myusername:' + pwd + '@%s' % dsn_tns)
df.to_sql('test_table', engine.connect(), if_exists='replace')

But with any regular-sized dataframes (mine has 60k rows, not so big), the code became unusable as it never finished in the time I was willing to wait (definitely more than 10 min). I've googled and searched quite a few times and the closest solution was the answer given by ansonwin this question. But that one was about mysql, not oracle. As Ziggy Eunicienpointed out, it did not work for oracle. Any ideas?

但是对于任何常规大小的数据帧(我的有 60k 行,不是那么大),代码变得无法使用,因为它在我愿意等待的时间内从未完成(绝对超过 10 分钟)。我在谷歌上搜索了很多次,最接近的解决方案是ansonw这个问题中给出的答案。但那是关于mysql,而不是oracle。正如Ziggy Eunicien指出的那样,它不适用于 oracle。有任何想法吗?

EDIT

编辑

Here's a sample of rows in the dataframe:

这是数据框中的行示例:

id          name            premium     created_date    init_p  term_number uprate  value   score   group   action_reason
160442353   LDP: Review     1295.619617 2014-01-20  1130.75     1           7       -42 236.328243  6       pass
164623435   TRU: Referral   453.224880  2014-05-20  0.00        11          NaN     -55 38.783290   1       suppress

and here is the data types for the df:

这是 df 的数据类型:

id               int64
name             object
premium          float64
created_date     object
init_p           float64
term_number      float64
uprate           float64
value            float64
score            float64
group            int64
action_reason    object

回答by MaxU

Pandas + SQLAlchemy per default save all object(string) columns as CLOBin Oracle DB, which makes insertion extremelyslow.

Pandas + SQLAlchemy 默认将所有object(字符串)列保存为Oracle DB 中的CLOB,这使得插入非常缓慢。

Here are some tests:

以下是一些测试:

import pandas as pd
import cx_Oracle
from sqlalchemy import types, create_engine

#######################################################
### DB connection strings config
#######################################################
tns = """
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = my-db-scan)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = my_service_name)
    )
  )
"""

usr = "test"
pwd = "my_oracle_password"

engine = create_engine('oracle+cx_oracle://%s:%s@%s' % (usr, pwd, tns))

# sample DF [shape: `(2000, 11)`]
# i took your 2 rows DF and replicated it: `df = pd.concat([df]* 10**3, ignore_index=True)`
df = pd.read_csv('/path/to/file.csv')

DF info:

DF信息:

In [61]: df.shape
Out[61]: (2000, 11)

In [62]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 11 columns):
id               2000 non-null int64
name             2000 non-null object
premium          2000 non-null float64
created_date     2000 non-null datetime64[ns]
init_p           2000 non-null float64
term_number      2000 non-null int64
uprate           1000 non-null float64
value            2000 non-null int64
score            2000 non-null float64
group            2000 non-null int64
action_reason    2000 non-null object
dtypes: datetime64[ns](1), float64(4), int64(4), object(2)
memory usage: 172.0+ KB

Let's check how long will it take to store it to Oracle DB:

让我们检查将其存储到 Oracle DB 需要多长时间:

In [57]: df.shape
Out[57]: (2000, 11)

In [58]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace')
1 loop, best of 1: 16 s per loop

In Oracle DB (pay attention at CLOB's):

在 Oracle DB 中(注意 CLOB):

AAA> desc test.test_table
 Name                            Null?    Type
 ------------------------------- -------- ------------------
 ID                                       NUMBER(19)
 NAME                                     CLOB        #  !!!
 PREMIUM                                  FLOAT(126)
 CREATED_DATE                             DATE
 INIT_P                                   FLOAT(126)
 TERM_NUMBER                              NUMBER(19)
 UPRATE                                   FLOAT(126)
 VALUE                                    NUMBER(19)
 SCORE                                    FLOAT(126)
 group                                    NUMBER(19)
 ACTION_REASON                            CLOB        #  !!!

Now let's instruct pandas to save all objectcolumns as VARCHAR data types:

现在让我们指示 pandas 将所有object列保存为 VARCHAR 数据类型:

In [59]: dtyp = {c:types.VARCHAR(df[c].str.len().max())
    ...:         for c in df.columns[df.dtypes == 'object'].tolist()}
    ...:

In [60]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp)
1 loop, best of 1: 335 ms per loop

This time it was approx. 48 times faster

这次是大约。快 48 倍

Check in Oracle DB:

签入 Oracle 数据库:

 AAA> desc test.test_table
 Name                          Null?    Type
 ----------------------------- -------- ---------------------
 ID                                     NUMBER(19)
 NAME                                   VARCHAR2(13 CHAR)        #  !!!
 PREMIUM                                FLOAT(126)
 CREATED_DATE                           DATE
 INIT_P                                 FLOAT(126)
 TERM_NUMBER                            NUMBER(19)
 UPRATE                                 FLOAT(126)
 VALUE                                  NUMBER(19)
 SCORE                                  FLOAT(126)
 group                                  NUMBER(19)
 ACTION_REASON                          VARCHAR2(8 CHAR)        #  !!!

Let's test it with 200.000 rows DF:

让我们用 200.000 行 DF 测试它:

In [69]: df.shape
Out[69]: (200000, 11)

In [70]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp, chunksize=10**4)
1 loop, best of 1: 4.68 s per loop

It took ~5 seconds for 200K rows DF in my test (not the fastest) environment.

在我的测试(不是最快的)环境中,200K 行 DF 花费了大约 5 秒。

Conclusion:use the following trick in order to explicitly specify dtypefor all DF columns of objectdtype when saving DataFrames to Oracle DB. Otherwise it'll be saved as CLOB data type, which requires special treatment and makes it very slow

结论:在将 DataFrames 保存到 Oracle DB 时,使用以下技巧来明确指定dtype 的dtype所有 DF 列object。否则会保存为CLOB数据类型,需要特殊处理,速度很慢

dtyp = {c:types.VARCHAR(df[c].str.len().max())
        for c in df.columns[df.dtypes == 'object'].tolist()}

df.to_sql(..., dtype=dtyp)