pandas 为什么我会在一个小 df 上使用 fast_executemany 出现内存错误?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49887717/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:28:51  来源:igfitidea点击:

Why would I get a memory error with fast_executemany on a tiny df?

pythonsql-serverpandassqlalchemypyodbc

提问by Martin Bobak

I was looking for ways to speed up pushing a dataframe to sql server and stumbled upon an approach here.This approach blew me away in terms of speed. Using normal to_sqltook almost 2 hours and this script was done in 12.54 seconds to push a 100k row X 100 column df.

我一直在寻找加速将数据帧推送到 sql server 的方法,并在这里偶然发现了一种方法这种方法在速度方面让我大吃一惊。使用 normalto_sql花了将近 2 个小时,这个脚本在 12.54 秒内完成以推送 100k 行 X 100 列 df。

So after testing the code below with a sample df, I attempted to use a df that had many different datatypes (int, string, floats, Booleans). However, I was sad to see a memory error. So I started reducing the size of my df to to see what the limitations were. I noticed that if my df had any strings then I wasn't able to load to sql server. I am having trouble isolating the issue further. The script below is taken from the question in the link, however, I added a tiny df with strings. Any suggestions on how to rectify this issue would be great!

因此,在使用示例 df 测试下面的代码后,我尝试使用具有许多不同数据类型(整数、字符串、浮点数、布尔值)的 df。但是,看到内存错误我很难过。所以我开始减少我的 df 的大小,看看有什么限制。我注意到如果我的 df 有任何字符串,那么我就无法加载到 sql server。我在进一步隔离问题时遇到了麻烦。下面的脚本取自链接中的问题,但是,我添加了一个带有字符串的小 df。有关如何纠正此问题的任何建议都会很棒!

import pandas as pd
import numpy as np
import time
from sqlalchemy import create_engine, event
from urllib.parse import quote_plus
import pyodbc

conn =  "DRIVER={SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"
quoted = quote_plus(conn)
new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
engine = create_engine(new_con)


@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    print("FUNC call")
    if executemany:
        cursor.fast_executemany = True


table_name = 'fast_executemany_test'
df1 = pd.DataFrame({'col1':['tyrefdg','ertyreg','efdgfdg'],
                   'col2':['tydfggfdgrefdg','erdfgfdgfdgfdgtyreg','edfgfdgdfgdffdgfdg']
                   })



s = time.time()
df1.to_sql(table_name, engine, if_exists = 'replace', chunksize = None)
print(time.time() - s)

回答by Gord Thompson

I was able to reproduce your issue using pyodbc 4.0.23. The MemoryErrorwas related to your use of the ancient

我能够使用 pyodbc 4.0.23 重现您的问题。这MemoryError与您对古代的使用有关

DRIVER={SQL Server}

Further testing using

进一步测试使用

DRIVER=ODBC Driver 11 for SQL Server

also failed, with

也失败了,与

Function sequence error (0) (SQLParamData)

函数序列错误 (0) (SQLParamData)

which was related to an existing pyodbc issue on GitHub. I posted my findings here.

这与 GitHub 上现有的 pyodbc 问题有关。我在这里发布了我的发现。

That issue is still under investigation. In the meantime you might be able to proceed by

这个问题还在调查中。在此期间,您可能可以通过

  • using a newer ODBC driver like DRIVER=ODBC Driver 13 for SQL Server, and
  • running pip install pyodbc==4.0.22to use an earlier version of pyodbc.
  • 使用较新的 ODBC 驱动程序,例如DRIVER=ODBC Driver 13 for SQL Server,
  • 运行pip install pyodbc==4.0.22以使用较早版本的 pyodbc。