pandas 基于多索引的多个级别有效地连接两个数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23937433/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:05:44  来源:igfitidea点击:

Efficiently joining two dataframes based on multiple levels of a multiindex

pythonjoinpandas

提问by Caleb

I frequently have a dataframe with a large multiindex, and a secondary DataFrame with a multiindex that is a subset of the larger one. The secondary dataframe is usually some kind of lookup table. I often want to add the columns from the lookup table to the larger dataframe. The primary DataFrame is often very large, so I want to do this efficiently.

我经常有一个带有大型多索引的数据帧,以及一个带有多索引的辅助数据帧,它是较大索引的子集。辅助数据框通常是某种查找表。我经常想将查找表中的列添加到更大的数据框中。主要的 DataFrame 通常非常大,所以我想有效地做到这一点。

Here is an imaginary example, where I want to join df2 to df1:

这是一个虚构的例子,我想将 df2 加入 df1:

   In [11]: arrays = [    ['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'],
   ....:               ['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'],
   ....:               ['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']]

In [12]: tuples = list(zip(*arrays))

In [13]: index = pd.MultiIndex.from_tuples(tuples, names=['Body', 'Season','Item'])

In [14]: df1 = pd.DataFrame(np.random.randn(8,2), index=index,columns=['A','B'])

In [15]: df1
Out[15]:
                          A         B
Body Season Item
sun  summer one   -0.121588  0.272774
     winter one    0.233562 -2.005623
            two   -1.034642  0.315065
moon summer one    0.184548  0.820873
            two    0.838290  0.495047
            three  0.450813 -2.040089
     winter one   -1.149993 -0.498148
            two    2.406824 -2.031849

[8 rows x 2 columns]


In [16]: index2= pd.MultiIndex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],names=['Body','Season'])

In [17]: df2 = pd.DataFrame(['Good','Bad','Ugly','Confused'],index=index2,columns = ['Mood'])

In [18]: df2
Out[18]:
                 Mood
Body Season
sun  summer      Good
     winter       Bad
moon summer      Ugly
     winter  Confused

[4 rows x 1 columns]

Now, suppose I want to add the columns from df2 to df1? This line is the only way I could find to do the job:

现在,假设我想将 df2 中的列添加到 df1?这条线是我能找到的完成这项工作的唯一方法:

In [19]: df1 = df1.reset_index().join(df2,on=['Body','Season']).set_index(df1.index.names)

In [20]: df1
Out[20]:
                          A         B      Mood
Body Season Item
sun  summer one   -0.121588  0.272774      Good
     winter one    0.233562 -2.005623       Bad
            two   -1.034642  0.315065       Bad
moon summer one    0.184548  0.820873      Ugly
            two    0.838290  0.495047      Ugly
            three  0.450813 -2.040089      Ugly
     winter one   -1.149993 -0.498148  Confused
            two    2.406824 -2.031849  Confused

[8 rows x 3 columns]

It works, but there are two problems with this method. First, the line is ugly. Needing to reset the index, then recreate the multiindex, makes this simple operation seem needlessly complicated. Second, if I understand correctly, every time I run reset_index() and set_index(), a copy of the dataframe is created. I am often working with very large dataframes, and this seems very inefficient.

它有效,但这种方法有两个问题。首先,这条线很丑。需要重置索引,然后重新创建多索引,使这个简单的操作看起来不必要地复杂。其次,如果我理解正确,每次运行 reset_index() 和 set_index() 时,都会创建数据帧的副本。我经常使用非常大的数据帧,这似乎非常低效。

Is there a better way to do this?

有一个更好的方法吗?

采纳答案by Jeff

This is not implemented internally ATM, but your soln is the recommended one, see hereas well the issue

这不是在 ATM 内部实现的,但您的解决方案是推荐的,请参见此处问题

You can simply wrap this in a function if you want to make it look nicer. reset_index/set_indexdo copy (though you can pass an inplace=Trueargument if you want); it IS truly inplace as these are just changing the index attribute.

如果你想让它看起来更好,你可以简单地将它包装在一个函数中。reset_index/set_index执行复制(尽管您可以根据需要传递inplace=True参数);它是真正到位的,因为这些只是改变了索引属性。

You could patch in a nice function like:

您可以修补一个不错的功能,例如:

def merge_multi(self, df, on):
    return self.reset_index().join(df,on=on).set_index(self.index.names)
DataFrame.merge_multi = merge_multi

df1.merge_multi(df2,on=['Body','Season'])

However, merging by definition creates new data, so not sure how much this will actually save you.

但是,根据定义合并会创建新数据,因此不确定这实际上会为您节省多少。

A better method is to build up smaller frames, then do a larger merge. You also might want to do something like this

更好的方法是建立更小的框架,然后进行更大的合并。你也可能想要做像这样

回答by Gio

I know this is old but join in Pandas 1.0.3(and at least since 0.24), allows merging of multiIndex dataframes with partially matching indices.

我知道这是旧的,但加入 Pandas 1.0.3至少从 0.24 开始),允许合并具有部分匹配索引的 multiIndex 数据帧。

Following your example:

按照你的例子:

df1 = df1.join(df2, on=['Body','Season'])