Python 熊猫更新sql

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31988322/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:50:27  来源:igfitidea点击:

Pandas update sql

pythonpostgresqlpandas

提问by darkpool

Is there any way to do an SQL update-where from a dataframe without iterating through each line? I have a postgresql database and to update a table in the db from a dataframe I would use psycopg2 and do something like:

有没有办法在不遍历每一行的情况下从数据帧进行 SQL 更新?我有一个 postgresql 数据库,要从数据帧更新数据库中的表,我将使用 psycopg2 并执行以下操作:

con = psycopg2.connect(database='mydb', user='abc', password='xyz')
cur = con.cursor()

for index, row in df.iterrows():
    sql = 'update table set column = %s where column = %s'
    cur.execute(sql, (row['whatver'], row['something']))
con.commit()

But on the other hand if im either reading a table from sql or writing an entire dataframe to sql (with no update-where), then I would just use pandas and sqlalchemy. Something like:

但另一方面,如果我从 sql 读取表或将整个数据帧写入 sql(没有更新位置),那么我将只使用 Pandas 和 sqlalchemy。就像是:

engine = create_engine('postgresql+psycopg2://user:pswd@mydb')
df.to_sql('table', engine, if_exists='append')

It's great just having a 'one-liner' using to_sql. Isn't there something similar to do an update-where from pandas to postgresql? Or is the only way to do it by iterating through each row like i've done above. Isn't iterating through each row an inefficient way to do it?

使用 to_sql 有一个“单行”真是太棒了。是不是有类似的东西来做一个更新 - 从熊猫到 postgresql 的地方?或者是通过像我上面所做的那样迭代每一行来做到这一点的唯一方法。遍历每一行不是一种低效的方法吗?

回答by firelynx

I have so far not seen a case where the pandas sql connector can be used in any scalable way to updatedatabase data. It may have seemed like a good idea to build one, but really, for operational work it just does not scale.

到目前为止,我还没有看到可以以任何可扩展的方式使用 pandas sql 连接器来更新数据库数据的情况。建立一个似乎是个好主意,但实际上,对于运营工作,它只是不能扩展。

What I would recommend is to dump your entire dataframe as CSV using

我建议使用将整个数据帧转储为 CSV

df.to_csv('filename.csv', encoding='utf-8')

Then loading the CSV into the database using COPYfor PostgreSQL or LOAD DATA INFILEfor MySQL.

然后使用COPYfor PostgreSQL 或LOAD DATA INFILEfor MySQL将 CSV 加载到数据库中。

If you do not make other changes to the table in question while the data is being manipulated by pandas, you can just load into the table.

如果在 Pandas 操作数据时不对相关表进行其他更改,则可以直接加载到表中。

If there are concurrency issues, you will have to load the data into a staging table that you then use to update your primary table from.

如果存在并发问题,则必须将数据加载到临时表中,然后用于更新主表。

In the later case, your primary table needs to have a datetime which tells you when the latest modification to it was so you can determine if your pandas changes are the latest or if the database changes should remain.

在后一种情况下,您的主表需要有一个日期时间,它告诉您最新修改的时间,以便您可以确定您的 Pandas 更改是最新的还是应该保留数据库更改。

回答by nabaz

I was wondering why donnt you update the df first based on your equation and then store the df to the database, you could use if_exists='replace', to store on the same table.

我想知道为什么不首先根据方程更新 df,然后将 df 存储到数据库中,您可以使用 if_exists='replace' 来存储在同一个表中。

回答by jeffery_the_wind

It looks like you are using some external data stored in dffor the conditions on updating your database table. If it is possible why not just do a one-line sql update?

看起来您正在使用存储在df中的一些外部数据作为更新数据库表的条件。如果可能,为什么不只进行一行 sql 更新?

If you are working with a smallish database (where loading the whole data to the python dataframe object isn't going to kill you) then you candefinitely conditionally update the dataframe after loading it using read_sql. Then you can use a keyword arg if_exists="replace"to replace the DB table with the new updated table.

如果您正在使用小型数据库(将整个数据加载到 python 数据帧对象不会杀死您),那么您绝对可以在使用read_sql. 然后您可以使用关键字 argif_exists="replace"用新的更新表替换 DB 表。

df = pandas.read_sql("select * from your_table;", engine)

#update information (update your_table set column = "new value" where column = "old value")
#still may need to iterate for many old value/new value pairs
df[df['column'] == "old value", "column"] = "new value"

#send data back to sql
df.to_sql("your_table", engine, if_exists="replace")

Pandas is a powerful tool, where limited SQL support was just a small feature at first. As time goes by people are trying to use pandas as their only database interface software. I don't think pandas was ever meant to be an end-all for database interaction, but there are a lot of people working on new features all the time. See: https://github.com/pandas-dev/pandas/issues

Pandas 是一个强大的工具,在最初,有限的 SQL 支持只是一个小功能。随着时间的推移,人们正试图使用​​ Pandas 作为他们唯一的数据库接口软件。我不认为 Pandas 曾经是数据库交互的终结者,但是有很多人一直在研究新功能。见:https: //github.com/pandas-dev/pandas/issues

回答by Parfait

Consider a temp table which would be exact replica of your final table, cleaned out with each run:

考虑一个临时表,它是最终表的精确副本,每次运行都会清除:

engine = create_engine('postgresql+psycopg2://user:pswd@mydb')
df.to_sql('temp_table', engine, if_exists='replace')

sql = """
    UPDATE final_table AS f
    SET col1 = t.col1
    FROM temp_table AS t
    WHERE f.id = t.id
"""

with engine.begin() as conn:     # TRANSACTION
    conn.execute(sql)

回答by GavinBelson

If you need to update a panda based on multiple conditions to simulate SQL's:

如果您需要根据多个条件更新熊猫以模拟 SQL:

UPDATE table WHERE A > 7 AND B > 69

You can simply use .loc

你可以简单地使用 .loc

>>> df
      A   B    C
0     2  40  800
1     1  90  600
2     6  80  700
3  1998  70   55
4     1  90  300
5     7  80  700
6     4  20  300
7  1998  20    2
8     7  10  100
9  1998  60    2

>>> df.loc[(df['A'] > 7) & (df['B'] > 69) , 'C'] = 75

This will set 'C' = 75 where 'A' > 7 and 'B' > 69

这将设置 'C' = 75 其中 'A' > 7 和 'B' > 69