pandas 基于多索引的多个级别有效地连接两个数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23937433/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficiently joining two dataframes based on multiple levels of a multiindex
提问by Caleb
I frequently have a dataframe with a large multiindex, and a secondary DataFrame with a multiindex that is a subset of the larger one. The secondary dataframe is usually some kind of lookup table. I often want to add the columns from the lookup table to the larger dataframe. The primary DataFrame is often very large, so I want to do this efficiently.
我经常有一个带有大型多索引的数据帧,以及一个带有多索引的辅助数据帧,它是较大索引的子集。辅助数据框通常是某种查找表。我经常想将查找表中的列添加到更大的数据框中。主要的 DataFrame 通常非常大,所以我想有效地做到这一点。
Here is an imaginary example, where I want to join df2 to df1:
这是一个虚构的例子,我想将 df2 加入 df1:
In [11]: arrays = [ ['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'],
....: ['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'],
....: ['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']]
In [12]: tuples = list(zip(*arrays))
In [13]: index = pd.MultiIndex.from_tuples(tuples, names=['Body', 'Season','Item'])
In [14]: df1 = pd.DataFrame(np.random.randn(8,2), index=index,columns=['A','B'])
In [15]: df1
Out[15]:
A B
Body Season Item
sun summer one -0.121588 0.272774
winter one 0.233562 -2.005623
two -1.034642 0.315065
moon summer one 0.184548 0.820873
two 0.838290 0.495047
three 0.450813 -2.040089
winter one -1.149993 -0.498148
two 2.406824 -2.031849
[8 rows x 2 columns]
In [16]: index2= pd.MultiIndex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],names=['Body','Season'])
In [17]: df2 = pd.DataFrame(['Good','Bad','Ugly','Confused'],index=index2,columns = ['Mood'])
In [18]: df2
Out[18]:
Mood
Body Season
sun summer Good
winter Bad
moon summer Ugly
winter Confused
[4 rows x 1 columns]
Now, suppose I want to add the columns from df2 to df1? This line is the only way I could find to do the job:
现在,假设我想将 df2 中的列添加到 df1?这条线是我能找到的完成这项工作的唯一方法:
In [19]: df1 = df1.reset_index().join(df2,on=['Body','Season']).set_index(df1.index.names)
In [20]: df1
Out[20]:
A B Mood
Body Season Item
sun summer one -0.121588 0.272774 Good
winter one 0.233562 -2.005623 Bad
two -1.034642 0.315065 Bad
moon summer one 0.184548 0.820873 Ugly
two 0.838290 0.495047 Ugly
three 0.450813 -2.040089 Ugly
winter one -1.149993 -0.498148 Confused
two 2.406824 -2.031849 Confused
[8 rows x 3 columns]
It works, but there are two problems with this method. First, the line is ugly. Needing to reset the index, then recreate the multiindex, makes this simple operation seem needlessly complicated. Second, if I understand correctly, every time I run reset_index() and set_index(), a copy of the dataframe is created. I am often working with very large dataframes, and this seems very inefficient.
它有效,但这种方法有两个问题。首先,这条线很丑。需要重置索引,然后重新创建多索引,使这个简单的操作看起来不必要地复杂。其次,如果我理解正确,每次运行 reset_index() 和 set_index() 时,都会创建数据帧的副本。我经常使用非常大的数据帧,这似乎非常低效。
Is there a better way to do this?
有一个更好的方法吗?
采纳答案by Jeff
This is not implemented internally ATM, but your soln is the recommended one, see hereas well the issue
这不是在 ATM 内部实现的,但您的解决方案是推荐的,请参见此处的问题
You can simply wrap this in a function if you want to make it look nicer. reset_index/set_indexdo copy (though you can pass an inplace=Trueargument if you want); it IS truly inplace as these are just changing the index attribute.
如果你想让它看起来更好,你可以简单地将它包装在一个函数中。reset_index/set_index执行复制(尽管您可以根据需要传递inplace=True参数);它是真正到位的,因为这些只是改变了索引属性。
You could patch in a nice function like:
您可以修补一个不错的功能,例如:
def merge_multi(self, df, on):
return self.reset_index().join(df,on=on).set_index(self.index.names)
DataFrame.merge_multi = merge_multi
df1.merge_multi(df2,on=['Body','Season'])
However, merging by definition creates new data, so not sure how much this will actually save you.
但是,根据定义合并会创建新数据,因此不确定这实际上会为您节省多少。
A better method is to build up smaller frames, then do a larger merge. You also might want to do something like this
更好的方法是建立更小的框架,然后进行更大的合并。你也可能想要做像这样
回答by Gio
I know this is old but join in Pandas 1.0.3(and at least since 0.24), allows merging of multiIndex dataframes with partially matching indices.
我知道这是旧的,但加入 Pandas 1.0.3(至少从 0.24 开始),允许合并具有部分匹配索引的 multiIndex 数据帧。
Following your example:
按照你的例子:
df1 = df1.join(df2, on=['Body','Season'])

