pandas 查询pandas MultiIndex的正确方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36101873/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:54:01  来源:igfitidea点击:

The right way to query a pandas MultiIndex

pandas

提问by cjm2671

I've got a huge dataframe (13 million rows) which stocks and stock prices in. I've indexed them using MultiIndex(['stock', 'date']), where datehas been parsed as a DateTime.

我有一个巨大的数据框(1300 万行),其中包含股票和股票价格。我使用 将它们编入索引MultiIndex(['stock', 'date']),其中date已被解析为DateTime.

This means I can select price data based upon stock easily df.loc['AAPL'], and by specific day df.loc['AAPL'].loc['2015-05-05'].

这意味着我可以轻松地根据股票选择df.loc['AAPL']特定日期的价格数据df.loc['AAPL'].loc['2015-05-05']

What I want to know is, what is the fastest and most elegant accessor for getting all stock prices on a particular day (i.e. if the indices were reversed).

我想知道的是,获取特定日期所有股票价格的最快和最优雅的访问器是什么(即,如果指数反转)。

Currently I'm using df.xs("2015-05-05", level=1)- is that the right way? Is there a better/cleaner/faster way?

目前我正在使用df.xs("2015-05-05", level=1)- 这是正确的方法吗?有没有更好/更清洁/更快的方法?

采纳答案by JohnE

I think what you did is fine, but there are alternative ways also.

我认为你所做的很好,但也有其他方法。

>>> df = pd.DataFrame({ 
              'stock':np.repeat( ['AAPL','GOOG','YHOO'], 3 ),
              'date':np.tile( pd.date_range('5/5/2015', periods=3, freq='D'), 3 ),
              'price':(np.random.randn(9).cumsum() + 10) })

>>> df = df.set_index(['stock','date'])

                      price
stock date                 
AAPL  2015-05-05   8.538459
      2015-05-06   9.330140
      2015-05-07   8.968898
GOOG  2015-05-05   8.964389
      2015-05-06   9.828230
      2015-05-07   9.992985
YHOO  2015-05-05   9.929548
      2015-05-06   9.330295
      2015-05-07  10.676468

A slightly more standard way of using loc twice

使用 loc 两次的稍微更标准的方法

>>> df.loc['AAPL'].loc['2015-05-05']

would be to do

会做

>>> df.loc['AAPL','2015-05-05']

price    8.538459
Name: (AAPL, 2015-05-05 00:00:00), dtype: float64

And instead of xsyou could use an IndexSlice. I think for 2 levels xsis easier, but IndexSlice might be better past 2 levels.

而不是xs您可以使用 IndexSlice。我认为 2 个级别xs更容易,但 IndexSlice 可能会在 2 个级别后更好。

>>> idx=pd.IndexSlice

>>> df.loc[ idx[:,'2015-05-05'], : ]

                     price
stock date                
AAPL  2015-05-05  8.538459
GOOG  2015-05-05  8.964389
YHOO  2015-05-05  9.929548

And to be honest, I think the absolute easiest way here is use either date or stock (or neither) as index and then most selections are very straightforward. For example, if you remove the index completely you can effortlessly select by date:

老实说,我认为这里最简单的方法是使用日期或股票(或两者都不是)作为索引,然后大多数选择都非常简单。例如,如果您完全删除索引,您可以毫不费力地按日期选择:

>>> df = df.reset_index()
>>> df[ df['date']=='2015-05-05' ]

   index stock       date      price
0      0  AAPL 2015-05-05   8.538459
3      3  GOOG 2015-05-05   8.964389
6      6  YHOO 2015-05-05   9.929548

Doing some quickie timings with 3 stocks and 3000 dates (=9000 rows), I found that a simple boolean selection (no index) was about 35% faster than xs, and xs was about 35% faster than using IndexSlice. But see Jeff's comment below, you should expect the boolean selection to perform relative worse with more rows.

对 3 只股票和 3000 个日期(=9000 行)进行一些快速计时,我发现简单的布尔选择(无索引)比 xs 快约 35%,而 xs 比使用 IndexSlice 快约 35%。但是请参阅下面杰夫的评论,您应该期望布尔选择在行数更多时表现相对较差。

Of course, the best thing for you to do is test on your own data and see how it comes out.

当然,您最好的做法是测试您自己的数据,看看结果如何。