如何在 Pandas 中按子级索引过滤
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12224778/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to filter by sub-level index in Pandas
提问by bigbug
I have a 'df' which have a multilevel index (STK_ID,RPT_Date)
我有一个'df',它有一个多级索引(STK_ID,RPT_Date)
sales cogs net_pft
STK_ID RPT_Date
000876 20060331 NaN NaN NaN
20060630 857483000 729541000 67157200
20060930 1063590000 925140000 50807000
20061231 853960000 737660000 51574000
20070331 -2695245000 -2305078000 -167642500
20070630 1146245000 1050808000 113468500
20070930 1327970000 1204800000 84337000
20071231 1439140000 1331870000 53398000
20080331 -3135240000 -2798090000 -248054300
20080630 1932470000 1777010000 133756300
20080930 1873240000 1733660000 92099000
002254 20061231 -16169620000 -15332705000 -508333200
20070331 -763844000 -703460000 -1538000
20070630 501221000 289167000 118012200
20070930 460483000 274026000 95967000
How to write a command to filter the rows whose 'RPT_Date' contains '0630' (which is the Q2 report) ? the result should be :
如何编写命令来过滤 'RPT_Date' 包含 '0630' 的行(这是 Q2 报告)?结果应该是:
sales cogs net_pft
STK_ID RPT_Date
000876 20060630 857483000 729541000 67157200
20070630 1146245000 1050808000 113468500
20080630 1932470000 1777010000 133756300
002254 20070630 501221000 289167000 118012200
I am trying to use df[df['RPT_Date'].str.contains('0630')], but Pandas refuses to work as 'RPT_Date'is not a column but a sub_level index.
我正在尝试使用df[df['RPT_Date'].str.contains('0630')],但 Pandas 拒绝工作,因为'RPT_Date'它不是列而是 sub_level 索引。
Thanks for your tips ...
谢谢你的提示...
回答by Garrett
To use the "str.*" methods on a column, you could reset the index, filter rows with a column "str.*" method call, and re-create the index.
要在列上使用“str.*”方法,您可以重置索引,使用列“str.*”方法调用过滤行,然后重新创建索引。
In [72]: x = df.reset_index(); x[x.RPT_Date.str.endswith("0630")].set_index(['STK_ID', 'RPT_Date'])
Out[72]:
sales cogs net_pft
STK_ID RPT_Date
000876 20060630 857483000 729541000 67157200
20070630 1146245000 1050808000 113468500
20080630 1932470000 1777010000 133756300
002254 20070630 501221000 289167000 118012200
However, this approach is not particularly fast.
但是,这种方法并不是特别快。
In [73]: timeit x = df.reset_index(); x[x.RPT_Date.str.endswith("0630")].set_index(['STK_ID', 'RPT_Date'])
1000 loops, best of 3: 1.78 ms per loop
Another approach builds on the fact that a MultiIndex object behaves much like a list of tuples.
另一种方法基于这样一个事实,即 MultiIndex 对象的行为很像元组列表。
In [75]: df.index
Out[75]:
MultiIndex
[('000876', '20060331') ('000876', '20060630') ('000876', '20060930')
('000876', '20061231') ('000876', '20070331') ('000876', '20070630')
('000876', '20070930') ('000876', '20071231') ('000876', '20080331')
('000876', '20080630') ('000876', '20080930') ('002254', '20061231')
('002254', '20070331') ('002254', '20070630') ('002254', '20070930')]
Building on that, you can create a boolean array from a MultiIndex with df.index.map() and use the result to filter the frame.
在此基础上,您可以使用 df.index.map() 从 MultiIndex 创建一个布尔数组,并使用结果来过滤框架。
In [76]: df[df.index.map(lambda x: x[1].endswith("0630"))]
Out[76]:
sales cogs net_pft
STK_ID RPT_Date
000876 20060630 857483000 729541000 67157200
20070630 1146245000 1050808000 113468500
20080630 1932470000 1777010000 133756300
002254 20070630 501221000 289167000 118012200
This is also quite a bit faster.
这也快了很多。
In [77]: timeit df[df.index.map(lambda x: x[1].endswith("0630"))]
1000 loops, best of 3: 240 us per loop

