pandas 熊猫数据帧连接/更新(“upsert”)?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33001585/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:00:02  来源:igfitidea点击:

pandas DataFrame concat / update ("upsert")?

pythonpandas

提问by embeepea

I am looking for an elegant way to append all the rows from one DataFrame to another DataFrame (both DataFrames having the same index and column structure), but in cases where the same index value appears in both DataFrames, use the row from the second data frame.

我正在寻找一种优雅的方式将一个 DataFrame 中的所有行附加到另一个 DataFrame(两个 DataFrame 具有相同的索引和列结构),但是如果两个 DataFrame 中出现相同的索引值,请使用第二个数据中的行框架。

So, for example, if I start with:

因此,例如,如果我从以下内容开始:

df1:
                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'A2'   'B2'
    '2015-10-03'  'A3'   'B3'

df2:
    date            A      B
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

I would like the result to be:

我希望结果是:

                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

This is analogous to what I think is called "upsert" in some SQL systems --- a combination of update and insert, in the sense that each row from df2is either (a) used to update an existing row in df1if the row key already exists in df1, or (b) inserted into df1at the end if the row key does not already exist.

这类似于我认为在某些 SQL 系统中称为“更新插入”的内容——更新和插入的组合,从某种意义上说df2df1如果行键已经存在,则每一行都用于 (a) 用于更新现有行存在于 中df1,或者 (b)df1如果行键尚不存在,则插入到末尾。

I have come up with the following

我想出了以下内容

pd.concat([df1, df2])     # concat the two DataFrames
    .reset_index()        # turn 'date' into a regular column
    .groupby('date')      # group rows by values in the 'date' column
    .tail(1)              # take the last row in each group
    .set_index('date')    # restore 'date' as the index

which seems to work, but this relies on the order of the rows in each groupby group always being the same as the original DataFrames, which I haven't checked on, and seems displeasingly convoluted.

这似乎有效,但这取决于每个 groupby 组中的行顺序始终与原始 DataFrame 相同,我没有检查过,并且似乎令人不快地令人费解。

Does anyone have any ideas for a more straightforward solution?

有没有人对更直接的解决方案有任何想法?

回答by Alexander

One solution is to conatenate df1with new rows in df2(i.e. where the index does not match). Then update the values with those from df2.

一种解决方案是连接df1新行df2(即索引不匹配的地方)。然后用来自 的值更新值df2

df = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
df.update(df2)

>>> df
             A   B
2015-10-01  A1  B1
2015-10-02  a1  b1
2015-10-03  a2  b2
2015-10-04  a3  b3

EDIT:Per the suggestion of @chrisb, this can further be simplified as follows:

编辑:根据@chrisb 的建议,这可以进一步简化如下:

pd.concat([df1[~df1.index.isin(df2.index)], df2])

Thanks Chris!

谢谢克里斯!

回答by MisterMonk

In addition to the correct answer, watch out for columns that do not exist in both dataframes:

除了正确答案之外,还要注意两个数据框中都不存在的列:

    df1 = pd.DataFrame([['test',1, True], ['test2',2, True]]).set_index(0)
    df2 = pd.DataFrame([['test2',4], ['test3',3]]).set_index(0)

If you just use the aforementioned solution as-is, you get:

如果您按原样使用上述解决方案,您将获得:

    >>>     1   2
    0       
    test    1   True
    test2   4   NaN
    test3   3   NaN

But if you are expecting the following output:

但是,如果您期待以下输出:

    >>>     1   2
    0       
    test    1   True
    test2   4   True
    test3   3   NaN

Just change the statement to:

只需将语句更改为:

    df1 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
    df1.update(df2)