按 MultiIndex 级别或子级别对 Pandas DataFrame 进行切片

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22987015/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:55:03  来源:igfitidea点击:

Slice pandas DataFrame by MultiIndex level or sublevel

pythonpandas

提问by LondonRob

Inspired by this answerand the lack of an easy answer to this questionI found myself writing a little syntactic sugar to make life easier to filter by MultiIndex level.

灵感来自这个答案,以及缺乏一个简单的答案来这个问题,我发现自己写一点语法糖,使生活多指标由水平更容易过滤。

def _filter_series(x, level_name, filter_by):
    """
    Filter a pd.Series or pd.DataFrame x by `filter_by` on the MultiIndex level
    `level_name`

    Uses `pd.Index.get_level_values()` in the background. `filter_by` is either
    a string or an iterable.
    """
    if isinstance(x, pd.Series) or isinstance(x, pd.DataFrame):
        if type(filter_by) is str:
            filter_by = [filter_by]

        index = x.index.get_level_values(level_name).isin(filter_by)
        return x[index]
    else:
        print "Not a pandas object"

But if I know the pandas development team (and I'm starting to, slowly!) there's already a nice way to do this, and I just don't know what it is yet!

但是,如果我了解 Pandas 开发团队(而且我正在开始,慢慢地!)已经有一个很好的方法可以做到这一点,我只是不知道它是什么!

Am I right?

我对吗?

回答by Jeff

This is very easy using the new multi-index slicers in master/0.14 (releasing soon), see here

使用 master/0.14(即将发布)中的新多索引切片器非常容易,请参见此处

There is an open issue to make this syntatically easier (its not hard to do), see heree.g something like this: df.loc[{ 'third' : ['C1','C3'] }]I think is reasonable

有一个悬而未决的问题使这在语法上更容易(这并不难),请参见此处,例如:df.loc[{ 'third' : ['C1','C3'] }]我认为是合理的

Here's how you can do it (requires master/0.14):

这是您的操作方法(需要 master/0.14):

In [2]: def mklbl(prefix,n):
   ...:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ...: 


In [11]: index = MultiIndex.from_product([mklbl('A',4),
mklbl('B',2),
mklbl('C',4),
mklbl('D',2)],names=['first','second','third','fourth'])

In [12]: columns = ['value']

In [13]: df = DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))),index=index,columns=columns).sortlevel()

In [14]: df
Out[14]: 
                           value
first second third fourth       
A0    B0     C0    D0          0
                   D1          1
             C1    D0          2
                   D1          3
             C2    D0          4
                   D1          5
             C3    D0          6
                   D1          7
      B1     C0    D0          8
                   D1          9
             C1    D0         10
                   D1         11
             C2    D0         12
                   D1         13
             C3    D0         14
                   D1         15
A1    B0     C0    D0         16
                   D1         17
             C1    D0         18
                   D1         19
             C2    D0         20
                   D1         21
             C3    D0         22
                   D1         23
      B1     C0    D0         24
                   D1         25
             C1    D0         26
                   D1         27
             C2    D0         28
                   D1         29
             C3    D0         30
                   D1         31
A2    B0     C0    D0         32
                   D1         33
             C1    D0         34
                   D1         35
             C2    D0         36
                   D1         37
             C3    D0         38
                   D1         39
      B1     C0    D0         40
                   D1         41
             C1    D0         42
                   D1         43
             C2    D0         44
                   D1         45
             C3    D0         46
                   D1         47
A3    B0     C0    D0         48
                   D1         49
             C1    D0         50
                   D1         51
             C2    D0         52
                   D1         53
             C3    D0         54
                   D1         55
      B1     C0    D0         56
                   D1         57
             C1    D0         58
                   D1         59
                             ...

[64 rows x 1 columns]

Create an indexer across all of the levels, selecting all entries

创建跨所有级别的索引器,选择所有条目

In [15]: indexer = [slice(None)]*len(df.index.names)

Make the level we care about only have the entries we care about

让我们关心的关卡只有我们关心的条目

In [16]: indexer[df.index.names.index('third')] = ['C1','C3']

Select it (its important that this is a tuple!)

选择它(重要的是这是一个元组!)

In [18]: df.loc[tuple(indexer),:]
Out[18]: 
                           value
first second third fourth       
A0    B0     C1    D0          2
                   D1          3
             C3    D0          6
                   D1          7
      B1     C1    D0         10
                   D1         11
             C3    D0         14
                   D1         15
A1    B0     C1    D0         18
                   D1         19
             C3    D0         22
                   D1         23
      B1     C1    D0         26
                   D1         27
             C3    D0         30
                   D1         31
A2    B0     C1    D0         34
                   D1         35
             C3    D0         38
                   D1         39
      B1     C1    D0         42
                   D1         43
             C3    D0         46
                   D1         47
A3    B0     C1    D0         50
                   D1         51
             C3    D0         54
                   D1         55
      B1     C1    D0         58
                   D1         59
             C3    D0         62
                   D1         63

[32 rows x 1 columns]

回答by Pietro Battiston

I actually upvoted joris's answer... but unfortunately the refactoring he mentions has not happened in 0.14 and is not happening in 0.17 neither. So for the moment let me suggest a quick and dirty solution (obviously derived from Jeff's one):

我实际上赞成 joris 的回答……但不幸的是,他提到的重构在 0.14 中没有发生,在 0.17 中也没有发生。所以暂时让我建议一个快速而肮脏的解决方案(显然来自杰夫的解决方案):

def filter_by(df, constraints):
    """Filter MultiIndex by sublevels."""
    indexer = [constraints[name] if name in constraints else slice(None)
               for name in df.index.names]
    return df.loc[tuple(indexer)] if len(df.shape) == 1 else df.loc[tuple(indexer),]

pd.Series.filter_by = filter_by
pd.DataFrame.filter_by = filter_by

... to be used as

...用作

df.filter_by({'level_name' : value})

where valuecan be indeed a single value, but also a list, a slice...

wherevalue确实可以是单个值,但也可以是列表、切片...

(untested with Panels and higher dimension elements, but I do expect it to work)

(未经面板和更高维度元素的测试,但我确实希望它能够工作)

回答by joris

You have the filtermethod that can do things like this. Eg with the example that was asked in the linkedSO question:

你有filter可以做这样的事情的方法。例如,在链接的SO 问题中提出的示例:

In [188]: df.filter(like='0630', axis=0)
Out[188]: 
                      sales        cogs    net_pft
STK_ID RPT_Date                                   
876    20060630   857483000   729541000   67157200
       20070630  1146245000  1050808000  113468500
       20080630  1932470000  1777010000  133756300
2254   20070630   501221000   289167000  118012200

The filter method is refactoredat the moment (in upcoming 0.14), and a levelkeyword will be added (because now you can have a problem if the same labels appear in different levels of the index).

filter 方法现在被重构(在即将到来的 0.14 中),并且level将添加一个关键字(因为现在如果相同的标签出现在索引的不同级别,你可能会遇到问题)。