按 MultiIndex 级别或子级别对 Pandas DataFrame 进行切片
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22987015/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Slice pandas DataFrame by MultiIndex level or sublevel
提问by LondonRob
Inspired by this answerand the lack of an easy answer to this questionI found myself writing a little syntactic sugar to make life easier to filter by MultiIndex level.
灵感来自这个答案,以及缺乏一个简单的答案来这个问题,我发现自己写一点语法糖,使生活多指标由水平更容易过滤。
def _filter_series(x, level_name, filter_by):
"""
Filter a pd.Series or pd.DataFrame x by `filter_by` on the MultiIndex level
`level_name`
Uses `pd.Index.get_level_values()` in the background. `filter_by` is either
a string or an iterable.
"""
if isinstance(x, pd.Series) or isinstance(x, pd.DataFrame):
if type(filter_by) is str:
filter_by = [filter_by]
index = x.index.get_level_values(level_name).isin(filter_by)
return x[index]
else:
print "Not a pandas object"
But if I know the pandas development team (and I'm starting to, slowly!) there's already a nice way to do this, and I just don't know what it is yet!
但是,如果我了解 Pandas 开发团队(而且我正在开始,慢慢地!)已经有一个很好的方法可以做到这一点,我只是不知道它是什么!
Am I right?
我对吗?
回答by Jeff
This is very easy using the new multi-index slicers in master/0.14 (releasing soon), see here
使用 master/0.14(即将发布)中的新多索引切片器非常容易,请参见此处
There is an open issue to make this syntatically easier (its not hard to do), see heree.g something like this: df.loc[{ 'third' : ['C1','C3'] }]I think is reasonable
有一个悬而未决的问题使这在语法上更容易(这并不难),请参见此处,例如:df.loc[{ 'third' : ['C1','C3'] }]我认为是合理的
Here's how you can do it (requires master/0.14):
这是您的操作方法(需要 master/0.14):
In [2]: def mklbl(prefix,n):
...: return ["%s%s" % (prefix,i) for i in range(n)]
...:
In [11]: index = MultiIndex.from_product([mklbl('A',4),
mklbl('B',2),
mklbl('C',4),
mklbl('D',2)],names=['first','second','third','fourth'])
In [12]: columns = ['value']
In [13]: df = DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))),index=index,columns=columns).sortlevel()
In [14]: df
Out[14]:
value
first second third fourth
A0 B0 C0 D0 0
D1 1
C1 D0 2
D1 3
C2 D0 4
D1 5
C3 D0 6
D1 7
B1 C0 D0 8
D1 9
C1 D0 10
D1 11
C2 D0 12
D1 13
C3 D0 14
D1 15
A1 B0 C0 D0 16
D1 17
C1 D0 18
D1 19
C2 D0 20
D1 21
C3 D0 22
D1 23
B1 C0 D0 24
D1 25
C1 D0 26
D1 27
C2 D0 28
D1 29
C3 D0 30
D1 31
A2 B0 C0 D0 32
D1 33
C1 D0 34
D1 35
C2 D0 36
D1 37
C3 D0 38
D1 39
B1 C0 D0 40
D1 41
C1 D0 42
D1 43
C2 D0 44
D1 45
C3 D0 46
D1 47
A3 B0 C0 D0 48
D1 49
C1 D0 50
D1 51
C2 D0 52
D1 53
C3 D0 54
D1 55
B1 C0 D0 56
D1 57
C1 D0 58
D1 59
...
[64 rows x 1 columns]
Create an indexer across all of the levels, selecting all entries
创建跨所有级别的索引器,选择所有条目
In [15]: indexer = [slice(None)]*len(df.index.names)
Make the level we care about only have the entries we care about
让我们关心的关卡只有我们关心的条目
In [16]: indexer[df.index.names.index('third')] = ['C1','C3']
Select it (its important that this is a tuple!)
选择它(重要的是这是一个元组!)
In [18]: df.loc[tuple(indexer),:]
Out[18]:
value
first second third fourth
A0 B0 C1 D0 2
D1 3
C3 D0 6
D1 7
B1 C1 D0 10
D1 11
C3 D0 14
D1 15
A1 B0 C1 D0 18
D1 19
C3 D0 22
D1 23
B1 C1 D0 26
D1 27
C3 D0 30
D1 31
A2 B0 C1 D0 34
D1 35
C3 D0 38
D1 39
B1 C1 D0 42
D1 43
C3 D0 46
D1 47
A3 B0 C1 D0 50
D1 51
C3 D0 54
D1 55
B1 C1 D0 58
D1 59
C3 D0 62
D1 63
[32 rows x 1 columns]
回答by Pietro Battiston
I actually upvoted joris's answer... but unfortunately the refactoring he mentions has not happened in 0.14 and is not happening in 0.17 neither. So for the moment let me suggest a quick and dirty solution (obviously derived from Jeff's one):
我实际上赞成 joris 的回答……但不幸的是,他提到的重构在 0.14 中没有发生,在 0.17 中也没有发生。所以暂时让我建议一个快速而肮脏的解决方案(显然来自杰夫的解决方案):
def filter_by(df, constraints):
"""Filter MultiIndex by sublevels."""
indexer = [constraints[name] if name in constraints else slice(None)
for name in df.index.names]
return df.loc[tuple(indexer)] if len(df.shape) == 1 else df.loc[tuple(indexer),]
pd.Series.filter_by = filter_by
pd.DataFrame.filter_by = filter_by
... to be used as
...用作
df.filter_by({'level_name' : value})
where valuecan be indeed a single value, but also a list, a slice...
wherevalue确实可以是单个值,但也可以是列表、切片...
(untested with Panels and higher dimension elements, but I do expect it to work)
(未经面板和更高维度元素的测试,但我确实希望它能够工作)
回答by joris
You have the filtermethod that can do things like this. Eg with the example that was asked in the linkedSO question:
你有filter可以做这样的事情的方法。例如,在链接的SO 问题中提出的示例:
In [188]: df.filter(like='0630', axis=0)
Out[188]:
sales cogs net_pft
STK_ID RPT_Date
876 20060630 857483000 729541000 67157200
20070630 1146245000 1050808000 113468500
20080630 1932470000 1777010000 133756300
2254 20070630 501221000 289167000 118012200
The filter method is refactoredat the moment (in upcoming 0.14), and a levelkeyword will be added (because now you can have a problem if the same labels appear in different levels of the index).
filter 方法现在被重构(在即将到来的 0.14 中),并且level将添加一个关键字(因为现在如果相同的标签出现在索引的不同级别,你可能会遇到问题)。

