pandas 基于多索引的多个级别有效地连接两个数据帧

Question

提问by Caleb

I frequently have a dataframe with a large multiindex, and a secondary DataFrame with a multiindex that is a subset of the larger one. The secondary dataframe is usually some kind of lookup table. I often want to add the columns from the lookup table to the larger dataframe. The primary DataFrame is often very large, so I want to do this efficiently.

我经常有一个带有大型多索引的数据帧，以及一个带有多索引的辅助数据帧，它是较大索引的子集。辅助数据框通常是某种查找表。我经常想将查找表中的列添加到更大的数据框中。主要的 DataFrame 通常非常大，所以我想有效地做到这一点。

Here is an imaginary example, where I want to join df2 to df1:

这是一个虚构的例子，我想将 df2 加入 df1：

   In [11]: arrays = [    ['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'],
   ....:               ['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'],
   ....:               ['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']]

In [12]: tuples = list(zip(*arrays))

In [13]: index = pd.MultiIndex.from_tuples(tuples, names=['Body', 'Season','Item'])

In [14]: df1 = pd.DataFrame(np.random.randn(8,2), index=index,columns=['A','B'])

In [15]: df1
Out[15]:
                          A         B
Body Season Item
sun  summer one   -0.121588  0.272774
     winter one    0.233562 -2.005623
            two   -1.034642  0.315065
moon summer one    0.184548  0.820873
            two    0.838290  0.495047
            three  0.450813 -2.040089
     winter one   -1.149993 -0.498148
            two    2.406824 -2.031849

[8 rows x 2 columns]


In [16]: index2= pd.MultiIndex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],names=['Body','Season'])

In [17]: df2 = pd.DataFrame(['Good','Bad','Ugly','Confused'],index=index2,columns = ['Mood'])

In [18]: df2
Out[18]:
                 Mood
Body Season
sun  summer      Good
     winter       Bad
moon summer      Ugly
     winter  Confused

[4 rows x 1 columns]

Now, suppose I want to add the columns from df2 to df1? This line is the only way I could find to do the job:

现在，假设我想将 df2 中的列添加到 df1？这条线是我能找到的完成这项工作的唯一方法：

In [19]: df1 = df1.reset_index().join(df2,on=['Body','Season']).set_index(df1.index.names)

In [20]: df1
Out[20]:
                          A         B      Mood
Body Season Item
sun  summer one   -0.121588  0.272774      Good
     winter one    0.233562 -2.005623       Bad
            two   -1.034642  0.315065       Bad
moon summer one    0.184548  0.820873      Ugly
            two    0.838290  0.495047      Ugly
            three  0.450813 -2.040089      Ugly
     winter one   -1.149993 -0.498148  Confused
            two    2.406824 -2.031849  Confused

[8 rows x 3 columns]

It works, but there are two problems with this method. First, the line is ugly. Needing to reset the index, then recreate the multiindex, makes this simple operation seem needlessly complicated. Second, if I understand correctly, every time I run reset_index() and set_index(), a copy of the dataframe is created. I am often working with very large dataframes, and this seems very inefficient.

它有效，但这种方法有两个问题。首先，这条线很丑。需要重置索引，然后重新创建多索引，使这个简单的操作看起来不必要地复杂。其次，如果我理解正确，每次运行 reset_index() 和 set_index() 时，都会创建数据帧的副本。我经常使用非常大的数据帧，这似乎非常低效。

Is there a better way to do this?

有一个更好的方法吗？

Answer 1

采纳答案by Jeff

This is not implemented internally ATM, but your soln is the recommended one, see hereas well the issue

这不是在 ATM 内部实现的，但您的解决方案是推荐的，请参见此处的问题

You can simply wrap this in a function if you want to make it look nicer. reset_index/set_indexdo copy (though you can pass an inplace=Trueargument if you want); it IS truly inplace as these are just changing the index attribute.

如果你想让它看起来更好，你可以简单地将它包装在一个函数中。reset_index/set_index执行复制（尽管您可以根据需要传递inplace=True参数）；它是真正到位的，因为这些只是改变了索引属性。

You could patch in a nice function like:

您可以修补一个不错的功能，例如：

def merge_multi(self, df, on):
    return self.reset_index().join(df,on=on).set_index(self.index.names)
DataFrame.merge_multi = merge_multi

df1.merge_multi(df2,on=['Body','Season'])

However, merging by definition creates new data, so not sure how much this will actually save you.

但是，根据定义合并会创建新数据，因此不确定这实际上会为您节省多少。

A better method is to build up smaller frames, then do a larger merge. You also might want to do something like this

更好的方法是建立更小的框架，然后进行更大的合并。你也可能想要做像这样

Answer 2

回答by Gio

I know this is old but join in Pandas 1.0.3(and at least since 0.24), allows merging of multiIndex dataframes with partially matching indices.

我知道这是旧的，但加入 Pandas 1.0.3（至少从 0.24 开始），允许合并具有部分匹配索引的 multiIndex 数据帧。

Following your example:

按照你的例子：

df1 = df1.join(df2, on=['Body','Season'])

pandas 基于多索引的多个级别有效地连接两个数据帧

提问by Caleb

采纳答案by Jeff

回答by Gio

相关推荐

最近更新

标签

pandas 基于多索引的多个级别有效地连接两个数据帧

提问by Caleb

采纳答案by Jeff

回答by Gio

相关推荐

pandas 使用股票报价识别熊猫数据框中的价格波动/趋势

pandas 熊猫：.groupby().size() 和百分比

使用 Numba 处理 Pandas DataFrame 时间序列的有效方法

按单列对 Pandas 数据框进行总和分组

相关推荐

最近更新

标签