pandas 熊猫数据帧连接/更新（“upsert”）？

Question

提问by embeepea

I am looking for an elegant way to append all the rows from one DataFrame to another DataFrame (both DataFrames having the same index and column structure), but in cases where the same index value appears in both DataFrames, use the row from the second data frame.

我正在寻找一种优雅的方式将一个 DataFrame 中的所有行附加到另一个 DataFrame（两个 DataFrame 具有相同的索引和列结构），但是如果两个 DataFrame 中出现相同的索引值，请使用第二个数据中的行框架。

So, for example, if I start with:

因此，例如，如果我从以下内容开始：

df1:
                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'A2'   'B2'
    '2015-10-03'  'A3'   'B3'

df2:
    date            A      B
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

I would like the result to be:

我希望结果是：

                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

This is analogous to what I think is called "upsert" in some SQL systems --- a combination of update and insert, in the sense that each row from df2is either (a) used to update an existing row in df1if the row key already exists in df1, or (b) inserted into df1at the end if the row key does not already exist.

这类似于我认为在某些 SQL 系统中称为“更新插入”的内容——更新和插入的组合，从某种意义上说df2，df1如果行键已经存在，则每一行都用于 (a) 用于更新现有行存在于中df1，或者 (b)df1如果行键尚不存在，则插入到末尾。

I have come up with the following

我想出了以下内容

pd.concat([df1, df2])     # concat the two DataFrames
    .reset_index()        # turn 'date' into a regular column
    .groupby('date')      # group rows by values in the 'date' column
    .tail(1)              # take the last row in each group
    .set_index('date')    # restore 'date' as the index

which seems to work, but this relies on the order of the rows in each groupby group always being the same as the original DataFrames, which I haven't checked on, and seems displeasingly convoluted.

这似乎有效，但这取决于每个 groupby 组中的行顺序始终与原始 DataFrame 相同，我没有检查过，并且似乎令人不快地令人费解。

Does anyone have any ideas for a more straightforward solution?

有没有人对更直接的解决方案有任何想法？

Answer 1

回答by Alexander

One solution is to conatenate df1with new rows in df2(i.e. where the index does not match). Then update the values with those from df2.

一种解决方案是连接df1新行df2（即索引不匹配的地方）。然后用来自的值更新值df2。

df = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
df.update(df2)

>>> df
             A   B
2015-10-01  A1  B1
2015-10-02  a1  b1
2015-10-03  a2  b2
2015-10-04  a3  b3

EDIT:Per the suggestion of @chrisb, this can further be simplified as follows:

编辑：根据@chrisb 的建议，这可以进一步简化如下：

pd.concat([df1[~df1.index.isin(df2.index)], df2])

Thanks Chris!

谢谢克里斯！

Answer 2

回答by MisterMonk

In addition to the correct answer, watch out for columns that do not exist in both dataframes:

除了正确答案之外，还要注意两个数据框中都不存在的列：

    df1 = pd.DataFrame([['test',1, True], ['test2',2, True]]).set_index(0)
    df2 = pd.DataFrame([['test2',4], ['test3',3]]).set_index(0)

If you just use the aforementioned solution as-is, you get:

如果您按原样使用上述解决方案，您将获得：

    >>>     1   2
    0       
    test    1   True
    test2   4   NaN
    test3   3   NaN

But if you are expecting the following output:

但是，如果您期待以下输出：

    >>>     1   2
    0       
    test    1   True
    test2   4   True
    test3   3   NaN

Just change the statement to:

只需将语句更改为：

    df1 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
    df1.update(df2)

pandas 熊猫数据帧连接/更新（“upsert”）？

提问by embeepea

回答by Alexander

回答by MisterMonk

相关推荐

最近更新

标签

pandas 熊猫数据帧连接/更新（“upsert”）？

提问by embeepea

回答by Alexander

回答by MisterMonk

相关推荐

如何在 Pandas 的组内使用 cumsum？

pandas 使用 iterrows() 时如何通过索引访问列

Python pandas .isnull() 不适用于对象 dtype 中的 NaT

pandas 将元组作为一行附加到数据帧

相关推荐

最近更新

标签