pandas 如何通过executemany()语句转换pandas数据帧以进行插入？

Question

提问by Colin O'Brien

I have a fairly big pandas dataframe - 50or so headers and a few hundred thousand rows of data - and I'm looking to transfer this data to a database using the ceODBCmodule. Previously I was using pyodbcand using a simple execute statement in a for loop but this was taking ridiculously long (1000 records per 10 minutes)...

我有一个相当大的Pandasdataframe - 50头和几十万行数据 - 我希望使用该ceODBC模块将这些数据传输到数据库。以前我pyodbc在 for 循环中使用并使用了一个简单的执行语句，但这花费的时间长得离谱（每 10 分钟 1000 条记录）...

I'm now trying a new module and am trying to introduce executemany()although I'm not quite sure what's meant by sequence of parameters in:

我现在正在尝试一个新模块并试图引入，executemany()尽管我不太确定参数序列的含义：

    cursor.executemany("""insert into table.name(a, b, c, d, e, f) 
values(?, ?, ?, ?, ?), sequence_of_parameters)

should it look like a constant list working through each header like

它应该看起来像一个通过每个标题工作的常量列表

    ['asdas', '1', '2014-12-01', 'true', 'asdasd', 'asdas', '2', 
'2014-12-02', 'true', 'asfasd', 'asdfs', '3', '2014-12-03', 'false', 'asdasd']

where this is an example of three rows

这是三行的示例

or what is the format that's needed?

或者需要的格式是什么？

as another related question, how then can I go about converting a regular pandas dataframe to this format?

作为另一个相关问题，我该如何将常规的 Pandas 数据帧转换为这种格式？

Thanks!

谢谢！

Answer 1

采纳答案by Colin O'Brien

I managed to figure this out in the end. So if you have a Pandas Dataframe which you want to write to a database using ceODBCwhich is the module I used, the code is:

我最终设法解决了这个问题。因此，如果您有一个 Pandas Dataframe 想要使用ceODBC我使用的模块写入数据库，则代码为：

(with all_dataas the dataframe) map dataframe values to string and store each row as a tuple in a list of tuples

（all_data作为数据框）将数据框值映射到字符串并将每一行作为元组存储在元组列表中

for r in all_data.columns.values:
    all_data[r] = all_data[r].map(str)
    all_data[r] = all_data[r].map(str.strip)   
tuples = [tuple(x) for x in all_data.values]

for the list of tuples, change all null value signifiers - which have been captured as strings in conversion above - into a null type which can be passed to the end database. This was an issue for me, might not be for you.

对于元组列表，将所有空值符号（在上面的转换中作为字符串捕获）更改为可以传递到最终数据库的空类型。这对我来说是个问题，可能不适合你。

string_list = ['NaT', 'nan', 'NaN', 'None']

def remove_wrong_nulls(x):
    for r in range(len(x)):
        for i,e in enumerate(tuples):
            for j,k in enumerate(e):
                if k == x[r]:
                    temp=list(tuples[i])
                    temp[j]=None
                    tuples[i]=tuple(temp)

remove_wrong_nulls(string_list)

create a connection to the database

创建到数据库的连接

cnxn=ceODBC.connect('DRIVER={SOMEODBCDRIVER};DBCName=XXXXXXXXXXX;UID=XXXXXXX;PWD=XXXXXXX;QUIETMODE=YES;', autocommit=False)
cursor = cnxn.cursor()

define a function to turn the list of tuples into a new_listwhich is a further indexing on the list of tuples, into chunks of 1000. This was necessary for me to pass the data to the database whose SQL Query could not exceed 1MB.

定义一个函数，将元组列表转换new_list为一个进一步索引元组列表的 1000 块。这对我将数据传递到 SQL 查询不能超过 1MB 的数据库来说是必要的。

def chunks(l, n):
    n = max(1, n)
    return [l[i:i + n] for i in range(0, len(l), n)]

new_list = chunks(tuples, 1000)

define your query.

定义您的查询。

query = """insert into XXXXXXXXXXXX("XXXXXXXXXX", "XXXXXXXXX", "XXXXXXXXXXX") values(?,?,?)"""

Run through the the new_listcontaining the list of tuples in groups of 1000 and perform executemany. Follow this by committing and closing the connection and that's it :)

运行new_list包含以 1000 为一组的元组列表并执行executemany。通过提交和关闭连接来遵循这一点，就是这样:)

for i in range(len(new_list)):
    cursor.executemany(query, new_list[i])
cnxn.commit()
cnxn.close()

Answer 2

回答by ansen

You can try this:

你可以试试这个：

cursor.executemany(sql_str, your_dataframe.values.tolist())

Hope it helps.

希望能帮助到你。

Answer 3

回答by Victor Uriarte

Might be a little late to answer this question, but maybe it can still help someone. executemany()is not implemented by many ODBC. One of the ones that does have it is MySQL. When they refer to sequence of parameters they mean:

回答这个问题可能有点晚了，但也许它仍然可以帮助某人。executemany()许多 ODBC 没有实现。确实拥有它的其中之一是MySQL。当他们提到参数序列时，他们的意思是：

parameters=[{'name':'Jorge', 'age':22, 'sex':'M'}, 
            {'name':'Karen', 'age':25, 'sex':'F'}, 
            {'name':'James', 'age':29, 'sex':'M'}]

and for a query statement it would look something like:

对于查询语句，它看起来像：

SQL = INSERT IGNORE INTO WORKERS (NAME, AGE, SEX) VALUES (%(name)s, %(age)s, %(sex)s)

Which looks like you got there. A couple things though I want to point out in case it helps: pandas has a to_sqlfunction that inserts into a db if you provide it the connector object, and chunks the data as well.

看起来你到了那里。不过我想指出一些事情，以防万一它有帮助：pandas 有一个to_sql函数，如果您向它提供连接器对象，它就会插入到数据库中，并将数据分块。

To rapidly create a sequence of parameters from a pandas dataframe I found the following two methods helpful:

为了从 Pandas 数据帧快速创建一系列参数，我发现以下两种方法很有帮助：

# creates list of dict, list of parameters
# REF: https://groups.google.com/forum/#!topic/pydata/qna3Z3WmVpM
parameters = [df.iloc[line, :].to_dict() for line in range(len(df))]

# Cleaner Way
parameters = df.to_dict(orient='records')

pandas 如何通过executemany()语句转换pandas数据帧以进行插入？

提问by Colin O'Brien

采纳答案by Colin O'Brien

回答by ansen

回答by Victor Uriarte

相关推荐

最近更新

标签

pandas 如何通过executemany()语句转换pandas数据帧以进行插入？

提问by Colin O'Brien

采纳答案by Colin O'Brien

回答by ansen

回答by Victor Uriarte

相关推荐

pandas 熊猫数据框中行的距离矩阵

pandas：read_csv 如何强制 bool 数据为 dtype bool 而不是对象

使用 Scikit Learn 对时间序列 Pandas 数据框进行线性回归

pandas pandasql 不会导入：导入错误：无法导入名称 to_sql

相关推荐

最近更新

标签