Pandas 中的多索引排序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17242970/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:56:42  来源:igfitidea点击:

Multi-Index Sorting in Pandas

pythonsortingpandasmulti-index

提问by Keeth

I have a multi-index DataFrame created via a groupby operation. I'm trying to do a compound sort using several levels of the index, but I can't seem to find a sort function that does what I need.

我有一个通过 groupby 操作创建的多索引 DataFrame。我正在尝试使用索引的多个级别进行复合排序,但我似乎无法找到满足我需要的排序函数。

Initial dataset looks something like this (daily sales counts of various products):

初始数据集如下所示(各种产品的每日销售额):

         Date Manufacturer Product Name Product Launch Date  Sales
0  2013-01-01        Apple         iPod          2001-10-23     12
1  2013-01-01        Apple         iPad          2010-04-03     13
2  2013-01-01      Samsung       Galaxy          2009-04-27     14
3  2013-01-01      Samsung   Galaxy Tab          2010-09-02     15
4  2013-01-02        Apple         iPod          2001-10-23     22
5  2013-01-02        Apple         iPad          2010-04-03     17
6  2013-01-02      Samsung       Galaxy          2009-04-27     10
7  2013-01-02      Samsung   Galaxy Tab          2010-09-02      7

I use groupby to get a sum over the date range:

我使用 groupby 来获取日期范围内的总和:

> grouped = df.groupby(['Manufacturer', 'Product Name', 'Product Launch Date']).sum()
                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPad         2010-04-03              30
             iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

So far so good!

到现在为止还挺好!

Now the last thing I want to do is sort each manufacturer's products by launch date, but keep them grouped hierarchically under Manufacturer - here's all I am trying to do:

现在我想做的最后一件事是按发布日期对每个制造商的产品进行排序,但将它们按层次分组在制造商下 - 这就是我想要做的:

                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPod         2001-10-23              34
             iPad         2010-04-03              30
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

When I try sortlevel() I lose the nice per-company hierarchy I had before:

当我尝试 sortlevel() 时,我失去了之前好的每个公司层次结构:

> grouped.sortlevel('Product Launch Date')
                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
Apple        iPad         2010-04-03              30
Samsung      Galaxy Tab   2010-09-02              22

sort() and sort_index() just fail:

sort() 和 sort_index() 只是失败:

grouped.sort(['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'

grouped.sort_index(by=['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'

Seems like a simple operation, but I can't quite figure it out.

看起来很简单的操作,但我无法弄清楚。

I'm not tied to using a MultiIndex for this, but since that's what groupby() returns, that's what I've been working with.

我并没有为此使用 MultiIndex,但由于这是 groupby() 返回的内容,这就是我一直在使用的内容。

BTW the code to produce the initial DataFrame is:

顺便说一句,生成初始 DataFrame 的代码是:

data = {
  'Date': ['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01', '2013-01-02', '2013-01-02', '2013-01-02', '2013-01-02'],
  'Manufacturer' : ['Apple', 'Apple', 'Samsung', 'Samsung', 'Apple', 'Apple', 'Samsung', 'Samsung',],
  'Product Name' : ['iPod', 'iPad', 'Galaxy', 'Galaxy Tab', 'iPod', 'iPad', 'Galaxy', 'Galaxy Tab'], 
  'Product Launch Date' : ['2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02','2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02'],
  'Sales' : [12, 13, 14, 15, 22, 17, 10, 7]
}
df = DataFrame(data, columns=['Date', 'Manufacturer', 'Product Name', 'Product Launch Date', 'Sales'])

采纳答案by Andy Hayden

A hack would be to change the order of the levels:

一个黑客是改变级别的顺序:

In [11]: g
Out[11]:
                                               Sales
Manufacturer Product Name Product Launch Date
Apple        iPad         2010-04-03              30
             iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

In [12]: g.index = g.index.swaplevel(1, 2)

Sortlevel, which (as you've found) sorts the MultiIndex levels in order:

Sortlevel,它(如您所见)按顺序对 MultiIndex 级别进行排序:

In [13]: g = g.sortlevel()

And swap back:

并换回:

In [14]: g.index = g.index.swaplevel(1, 2)

In [15]: g
Out[15]:
                                               Sales
Manufacturer Product Name Product Launch Date
Apple        iPod         2001-10-23              34
             iPad         2010-04-03              30
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

I'm of the opinion that sortlevel should not sort the remaining labels in order, so will create a github issue.:) Although it's worth mentioning the docnote about "the need for sortedness".

我认为 sortlevel 不应该按顺序对剩余的标签进行排序,因此会产生 github 问题。:) 虽然值得一提的是关于“需要排序”的文档说明

Note: you could avoid the first swaplevelby reordering the order of the initial groupby:

注意:您可以swaplevel通过重新排序初始 groupby 的顺序来避免第一个:

g = df.groupby(['Manufacturer', 'Product Launch Date', 'Product Name']).sum()

回答by Jim

This one liner works for me:

这一个班轮对我有用:

In [1]: grouped.sortlevel(["Manufacturer","Product Launch Date"], sort_remaining=False)

                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPod         2001-10-23              34
             iPad         2010-04-03              30
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

Note this works too:

请注意,这也有效:

groups.sortlevel([0,2], sort_remaining=False)

This wouldn't have worked when you originally posted over two years ago, because sortlevel by default sorted on ALL indices which mucked up your company hierarchy. sort_remainingwhich disables that behavior was added last year. Here's the commit link for reference: https://github.com/pydata/pandas/commit/3ad64b11e8e4bef47e3767f1d31cc26e39593277

当您两年前最初发布时,这不会起作用,因为默认情况下 sortlevel 对所有索引进行排序,这会破坏您的公司层次结构。sort_remaining禁用该行为是去年添加的。这是供参考的提交链接:https: //github.com/pydata/pandas/commit/3ad64b11e8e4bef47e3767f1d31cc26e39593277

回答by fpersyn

To sort a MultiIndex by the "index columns" (aka. levels) you need to use the .sort_index()method and set its levelargument. If you want to sort by multiple levels, the argument needs to be set to a list of level names in sequential order.

要按“索引列”(又名级别)对 MultiIndex 进行排序,您需要使用该.sort_index()方法并设置其level参数。如果要按多个级别排序,则需要将参数设置为按顺序排列的级别名称列表。

This should give you the DataFrame you need:

这应该为您提供所需的 DataFrame:

df.groupby(['Manufacturer', 'Product Name', 'Launch Date']).sum().sort_index(level=['Manufacturer','Launch Date'])

回答by Xavi

If you want try to avoid multiple swaps within a very deep MultiIndex, you also could try with this:

如果您想避免在非常深的 MultiIndex 中进行多次交换,您也可以尝试使用以下方法:

  1. Slicing by level X (by list comprehension + .loc + IndexSlice)
  2. Sort the desired level (sortlevel(2))
  3. Concatenate every group of level X indexes
  1. 按级别 X 切片(按列表理解 + .loc + IndexSlice)
  2. 对所需级别进行排序 (sortlevel(2))
  3. 连接每组 X 级索引

Here you have the code:

代码如下:

import pandas as pd
idx = pd.IndexSlice
g = pd.concat([grouped.loc[idx[i,:,:],:].sortlevel(2) for i in grouped.index.levels[0]])
g

回答by David Hollett

If you are not concerned about conserving the index (I often prefer an arbitrary integer index) you can just use the following one-liner:

如果您不关心保存索引(我通常更喜欢任意整数索引),您可以只使用以下单行:

grouped.reset_index().sort(["Manufacturer","Product Launch Date"])