Pandas IndexSlice 是如何工作的
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44087637/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas how does IndexSlice work
提问by Cheng
I am following this tutorial: GitHub Link
我正在关注本教程:GitHub 链接
If you scroll down (Ctrl+F: Exercise: Select the most-reviewd beers ) to the section that says Exercise: Select the most-reviewd beers
:
如果向下滚动(Ctrl+F:练习:选择评论最多的啤酒)到以下部分Exercise: Select the most-reviewd beers
:
To select the most-reviewed beers:
选择评论最多的啤酒:
top_beers = df['beer_id'].value_counts().head(10).index
reviews.loc[pd.IndexSlice[:, top_beers], ['beer_name', 'beer_style']]
My question is the way of how the IndexSlice is used, how come you can skip the colon after top_beers and the code still run?
我的问题是如何使用IndexSlice,为什么在top_beers之后跳过冒号并且代码仍然运行?
reviews.loc[pd.IndexSlice[:, top_beers, :], ['beer_name', 'beer_style']]
There are three indexes, pofile_name
, beed_id
and time
. Why does pd.IndexSlice[:, top_beers]
work (without specify what to do with the time column)?
共有三个索引pofile_name
,beed_id
和time
。为什么pd.IndexSlice[:, top_beers]
有效(没有指定如何处理时间列)?
回答by normanius
To complement the previous answer, let me explain how pd.IndexSlice
works and why it is useful.
为了补充之前的答案,让我解释一下pd.IndexSlice
它是如何工作的以及它为什么有用。
Well, there is not much to say about its implementation. As you read in the source, it just does the following:
好吧,关于它的实现没什么好说的。当您阅读源代码时,它仅执行以下操作:
class IndexSlice(object):
def __getitem__(self, arg):
return arg
From this we see that pd.IndexSlice
only forwards the arguments that __getitem__
has received. Looks pretty stupid, doesn't it? However, it actually does something.
由此我们看到,pd.IndexSlice
只转发__getitem__
收到的参数。看起来很愚蠢,不是吗?然而,它确实做了一些事情。
As you certainly know already, obj.__getitem__(arg)
is called if you access an object obj
through its bracket operator obj[arg]
. For sequence-type objects, arg
can be either an integer or a slice object. We rarely construct slices ourselves. Rather, we'd use the slice operator :
(aka ellipsis) for this purpose, e.g. obj[0:5]
.
正如您肯定已经知道的那样,obj.__getitem__(arg)
如果您obj
通过括号运算符访问对象,则调用它obj[arg]
。对于序列类型的对象,arg
可以是整数或切片对象。我们很少自己构建切片。相反,我们会:
为此目的使用切片运算符(又名省略号),例如obj[0:5]
.
And here comes the point. The python interpretor converts these slice operators :
into slice objects before calling the object's __getitem__(arg)
method. Therefore, the return value of IndexSlice.__getItem__()
will actually be a slice, an integer (if no :
was used), or a tuple of these (if multiple arguments are passed). In summary, the only purpose of IndexSlice
is that we don't have to construct the slices on our own. This behavior is particularly useful for pd.DataFrame.loc
.
重点来了。:
在调用对象的__getitem__(arg)
方法之前,python 解释器将这些切片操作符转换为切片对象。因此, 的返回值IndexSlice.__getItem__()
实际上是一个切片、一个整数(如果没有:
使用)或这些的元组(如果传递多个参数)。总之, 的唯一目的IndexSlice
是我们不必自己构造切片。此行为对于pd.DataFrame.loc
.
Let's first have a look at the following examples:
让我们首先看看以下示例:
import pandas as pd
idx = pd.IndexSlice
print(idx[0]) # 0
print(idx[0,'a']) # (0, 'a')
print(idx[:]) # slice(None, None, None)
print(idx[0:3]) # slice(0, 3, None)
print(idx[0.1:2.3]) # slice(0.1, 2.3, None)
print(idx[0:3,'a':'c']) # (slice(0, 3, None), slice('a', 'c', None))
We observe that all usages of colons :
are converted into slice object. If multiple arguments are passed to the index operator, the arguments are turned into n-tuples.
我们观察到冒号的所有用法:
都转换为切片对象。如果将多个参数传递给索引运算符,则参数将转换为 n 元组。
To demonstrate how this could be useful for a pandas data-frame df
with a multi-level index, let's have a look at the following.
为了演示这对df
具有多级索引的 Pandas 数据框有何用处,让我们看看以下内容。
# A sample table with three-level row-index
#?and single-level column index.
import numpy as np
level0 = range(0,10)
level1 = list('abcdef')
level2 = ['I', 'II', 'III', 'IV']
mi = pd.MultiIndex.from_product([level0, level1, level2])
df = pd.DataFrame(np.random.random([len(mi),2]),
index=mi, columns=['col1', 'col2'])
# Return a view on 'col1', selecting all rows.
df.loc[:,'col1'] # pd.Series
#?Note: in the above example, the returned value has type
#?pd.Series, because only one column is returned. One can
# enforce the returned object to be a data-frame:
df.loc[:,['col1']] # pd.DataFrame, or
df.loc[:,'col1'].to_frame() #
# Select all rows with top-level values 0:3.
df.loc[0:3, 'col1']
# If we want to create a slice for multiple index levels
# we need to pass somehow a list of slices. The following
#?however leads to a SyntaxError because the slice
# operator ':' cannot be placed inside a list declaration.
df.loc[[0:3, 'a':'c'], 'col1']
# The following is valid python code, but looks clumsy:
df.loc[(slice(0, 3, None), slice('a', 'c', None)), 'col1']
#?Here is why pd.IndexSlice is useful. It helps
# to create a slice that makes use of two index-levels.
df.loc[idx[0:3, 'a':'c'], 'col1']
# We can expand the slice specification by a third level.
df.loc[idx[0:3, 'a':'c', 'I':'III'], 'col1']
#?A solitary slicing operator ':' means: take them all.
# It is equivalent to slice(None).
df.loc[idx[0:3, 'a':'c', :], 'col1'] # pd.Series
# Semantically, this is equivalent to the following,
#?because the last ':' in the previous example does
#?not add any information about the slice specification.
df.loc[idx[0:3, 'a':'c'], 'col1'] # pd.Series
#?The following lines are also equivalent, but
#?both expressions evaluate to a result with multiple columns.
df.loc[idx[0:3, 'a':'c', :], :] # pd.DataFrame
df.loc[idx[0:3, 'a':'c'], :] # pd.DataFrame
In summary, pd.IndexSlice
helps to improve readability when specifying slices for rows and column indices.
总之,pd.IndexSlice
在为行和列索引指定切片时有助于提高可读性。
What pandas then does with these slices is a different story. It essentially selects rows/columns, starting from the topmost index-level and reduces the selection when going further down the levels, depending on how many levels have been specified. pd.DataFrame.loc
is an object with its own __getitem__()
function that does all this.
pandas 对这些切片所做的事情是另一回事。它本质上选择行/列,从最顶层的索引级别开始,并在进一步降低级别时减少选择,具体取决于已指定的级别数。pd.DataFrame.loc
是一个对象,它有自己的__getitem__()
功能来完成所有这些。
As you pointed out already in one of your comments, pandas seemingly behaves weird in some special cases. The two examples you mentioned will actually evaluate to the same result. However, they are treated differently by pandas internally.
正如您在评论中已经指出的那样,Pandas在某些特殊情况下似乎表现得很奇怪。您提到的两个示例实际上将评估为相同的结果。但是,Pandas在内部对它们的处理方式不同。
# This will work.
reviews.loc[idx[top_reviewers, 99, :], ['beer_name', 'brewer_id']]
# This will fail with TypeError "unhashable type: 'Index'".
reviews.loc[idx[top_reviewers, 99] , ['beer_name', 'brewer_id']]
#?This fixes the problem. (pd.Index is not hashable, a tuple is.
# However, the problem matters only with the second expression.)
reviews.loc[idx[tuple(top_reviewers), 99] , ['beer_name', 'brewer_id']]
Admittedly, the difference is subtle.
诚然,差异是微妙的。
回答by TomAugspurger
Pandas only requires you to specify enough levels of the MultiIndex to remove an ambiguity. Since you're slicing on the 2nd level, you need the first :
to say I'm not filtering on this level.
Pandas 只要求您指定足够的 MultiIndex 级别来消除歧义。由于您是在第二级切片,因此您需要首先:
说明我没有在此级别上进行过滤。
Any additional levels not specified are returned in their entirety, so equivalent to a :
on each of those levels.
任何未指定的其他级别都将完整返回,因此相当于:
每个级别上的 a。