Python 在 Pandas MultiIndex DataFrame 中选择行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53927460/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:24:15  来源:igfitidea点击:

Select rows in pandas MultiIndex DataFrame

pythonpandasdataframeslicemulti-index

提问by cs95

What are the most common pandas ways to select/filter rows of a dataframe whose index is a MultiIndex?

选择/过滤索引为 MultiIndex数据帧的行的最常见的 Pandas 方法是什么?

  • Slicing based on a single value/label
  • Slicing based on multiple labels from one or more levels
  • Filtering on boolean conditions and expressions
  • Which methods are applicable in what circumstances
  • 基于单个值/标签的切片
  • 基于来自一个或多个级别的多个标签进行切片
  • 过滤布尔条件和表达式
  • 哪些方法适用于哪些情况

Assumptions for simplicity:

为简单起见的假设:

  1. input dataframe does not have duplicate index keys
  2. input dataframe below only has two levels. (Most solutions shown here generalize to N levels)
  1. 输入数据框没有重复的索引键
  2. 下面的输入数据框只有两个级别。(此处显示的大多数解决方案都适用于 N 级)


Example input:

示例输入:

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15
mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 1: Selecting a Single Item

问题 1:选择单个项目

How do I select rows having "a" in level "one"?

如何选择级别“一”中具有“a”的行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

Additionally, how would I be able to drop level "one" in the output?

此外,我如何才能在输出中降低“一”级?

     col
two     
t      0
u      1
v      2
w      3

Question 1b
How do I slice all rows with value "t" on level "two"?

问题 1b
如何在级别“二”上对所有值为“t”的行进行切片?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Question 2: Selecting Multiple Values in a Level

问题 2:在一个级别中选择多个值

How can I select rows corresponding to items "b" and "d" in level "one"?

如何在级别“一”中选择与项目“b”和“d”相对应的行?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 2b
How would I get all values corresponding to "t" and "w" in level "two"?

问题 2b
如何获得“二”级中“t”和“w”对应的所有值?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

Question 3: Slicing a Single Cross Section (x, y)

问题 3:切片单个横截面 (x, y)

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

如何检索横截面,即具有特定索引值的单行来自df?具体来说,我如何检索 的横截面('c', 'u'),由

         col
one two     
c   u      9

Question 4: Slicing Multiple Cross Sections [(a, b), (c, d), ...]

问题 4:切片多个横截面 [(a, b), (c, d), ...]

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

如何选择对应于('c', 'u'), 和的两行('a', 'w')

         col
one two     
c   u      9
a   w      3

Question 5: One Item Sliced per Level

问题 5:每层切一个项目

How can I retrieve all rows corresponding to "a" in level "one" or "t" in level "two"?

如何检索与级别“一”中的“a”或级别“二”中的“t”对应的所有行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

Question 6: Arbitrary Slicing

问题 6:任意切片

How can I slice specific cross sections? For "a" and "b", I would like to select all rows with sub-levels "u" and "v", and for "d", I would like to select rows with sub-level "w".

如何切片特定的横截面?对于“a”和“b”,我想选择子级别为“u”和“v”的所有行,对于“d”,我想选择子级别为“w”的行。

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

Question 7 will use a unique setup consisting of a numeric level:

np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    np.random.choice(10, size=16)
], names=['one', 'two'])

df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)

         col
one two     
a   5      0
    0      1
    3      2
    3      3
b   7      4
    9      5
    3      6
    5      7
    2      8
c   4      9
    7     10
d   6     11
    8     12
    8     13
    1     14
    6     15

问题 7 将使用由数字级别组成的独特设置:

np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    np.random.choice(10, size=16)
], names=['one', 'two'])

df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)

         col
one two     
a   5      0
    0      1
    3      2
    3      3
b   7      4
    9      5
    3      6
    5      7
    2      8
c   4      9
    7     10
d   6     11
    8     12
    8     13
    1     14
    6     15

Question 7: Filtering by numeric inequality on individual levels of the multiindex

问题 7:按多重指数的各个级别上的数字不等式进行过滤

How do I get all rows where values in level "two" are greater than 5?

如何获取级别“二”中的值大于 5 的所有行?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15


Note: This post will notgo through how to create MultiIndexes, how to perform assignment operations on them, or any performance related discussions (these are separate topics for another time).

注意:这篇文章不会介绍如何创建多索引、如何对它们执行赋值操作,或任何与性能相关的讨论(这些是另一次单独的主题)。

回答by cs95

MultiIndex / Advanced Indexing

多索引/高级索引

Note
This post will be structured in the following manner:

  1. The questions put forth in the OP will be addressed, one by one
  2. For each question, one or more methods applicable to solving this problem and getting the expected result will be demonstrated.

Notes (much like this one) will be included for readers interested in learning about additional functionality, implementation details, and other info cursory to the topic at hand. These notes have been compiled through scouring the docs and uncovering various obscure features, and from my own (admittedly limited) experience.

All code samples have created and tested on pandas v0.23.4, python3.7. If something is not clear, or factually incorrect, or if you did not find a solution applicable to your use case, please feel free to suggest an edit, request clarification in the comments, or open a new question, ....as applicable.

注意
这篇文章的结构如下:

  1. OP中提出的问题将一一解决
  2. 对于每个问题,将演示一种或多种适用于解决该问题并获得预期结果的方法。

将包含注释(很像这个),供有兴趣了解附加功能、实现细节和其他与手头主题相关的信息的读者使用。这些笔记是通过搜索文档和发现各种晦涩的功能以及我自己(公认的有限)经验编写的。

所有代码示例都在pandas v0.23.4, python3.7上创建和测试。如果有什么不清楚,或事实上不正确,或者如果您没有找到适用于您的用例的解决方案,请随时提出编辑建议,在评论中要求澄清,或提出新问题,....如适用.

Here is an introduction to some common idioms (henceforth referred to as the Four Idioms) we will be frequently re-visiting

这里介绍一些我们会经常重温的常见习语(以下简称“四大习语”)

  1. DataFrame.loc- A general solution for selection by label (+ pd.IndexSlicefor more complex applications involving slices)

  2. DataFrame.xs- Extract a particular cross section from a Series/DataFrame.

  3. DataFrame.query- Specify slicing and/or filtering operations dynamically (i.e., as an expression that is evaluated dynamically. Is more applicable to some scenarios than others. Also see this section of the docsfor querying on MultiIndexes.

  4. Boolean indexing with a mask generated using MultiIndex.get_level_values(often in conjunction with Index.isin, especially when filtering with multiple values). This is also quite useful in some circumstances.

  1. DataFrame.loc- 按标签选择的通用解决方案(+pd.IndexSlice用于涉及切片的更复杂的应用程序)

  2. DataFrame.xs- 从系列/数据帧中提取特定横截面。

  3. DataFrame.query- 动态指定切片和/或过滤操作(即,作为动态评估的表达式。比其他场景更适用于某些场景。另请参阅文档的这一部分以查询多索引。

  4. 使用生成的掩码进行布尔索引MultiIndex.get_level_values(通常与 结合使用Index.isin,尤其是在使用多个值进行过滤时)。这在某些情况下也非常有用。

It will be beneficial to look at the various slicing and filtering problems in terms of the Four Idioms to gain a better understanding what can be applied to a given situation. It is very important to understand that not all of the idioms will work equally well (if at all) in every circumstance. If an idiom has not been listed as a potential solution to a problem below, that means that idiom cannot be applied to that problem effectively.

根据四种习语来查看各种切片和过滤问题,以更好地了解可以应用于给定情况的内容,这将是有益的。了解并非所有习语在每种情况下都同样有效(如果有的话),了解这一点非常重要。如果某个习语没有被列为以下问题的潜在解决方案,则意味着该习语不能有效地应用于该问题。



Question 1

How do I select rows having "a" in level "one"?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

问题 1

如何选择级别“一”中具有“a”的行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

You can use loc, as a general purpose solution applicable to most situations:

您可以使用loc, 作为适用于大多数情况的通用解决方案:

df.loc[['a']]

At this point, if you get

此时,如果你得到

TypeError: Expected tuple, got str

That means you're using an older version of pandas. Consider upgrading! Otherwise, use df.loc[('a', slice(None)), :].

这意味着您使用的是旧版本的熊猫。考虑升级!否则,使用df.loc[('a', slice(None)), :].

Alternatively, you can use xshere, since we are extracting a single cross section. Note the levelsand axisarguments (reasonable defaults can be assumed here).

或者,您可以xs在此处使用,因为我们正在提取单个横截面。注意levelsaxis参数(这里可以假设合理的默认值)。

df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)

Here, the drop_level=Falseargument is needed to prevent xsfrom dropping level "one" in the result (the level we sliced on).

在这里,drop_level=False需要参数来防止xs在结果中删除级别“一”(我们切片的级别)。

Yet another option here is using query:

这里的另一个选择是使用query

df.query("one == 'a'")

If the index did not have a name, you would need to change your query string to be "ilevel_0 == 'a'".

如果索引没有名称,则需要将查询字符串更改为"ilevel_0 == 'a'".

Finally, using get_level_values:

最后,使用get_level_values

df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']

Additionally, how would I be able to drop level "one" in the output?

     col
two     
t      0
u      1
v      2
w      3

此外,我如何才能在输出中降低“一”级?

     col
two     
t      0
u      1
v      2
w      3

This can be easilydone using either

这可以很容易地使用

df.loc['a'] # Notice the single string argument instead the list.

Or,

或者,

df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')

Notice that we can omit the drop_levelargument (it is assumed to be Trueby default).

请注意,我们可以省略drop_level参数(True默认情况下假定为)。

Note
You may notice that a filtered DataFrame may still have all the levels, even if they do not show when printing the DataFrame out. For example,

v = df.loc[['a']]
print(v)
         col
one two     
a   t      0
    u      1
    v      2
    w      3

print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

You can get rid of these levels using MultiIndex.remove_unused_levels:

v.index = v.index.remove_unused_levels()

print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

注意
您可能会注意到过滤后的 DataFrame 可能仍然具有所有级别,即使它们在打印 DataFrame 时没有显示。例如,

v = df.loc[['a']]
print(v)
         col
one two     
a   t      0
    u      1
    v      2
    w      3

print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

您可以使用以下方法摆脱这些级别MultiIndex.remove_unused_levels

v.index = v.index.remove_unused_levels()

print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])


Question 1b

How do I slice all rows with value "t" on level "two"?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

问题 1b

如何在级别“二”上切片所有值为“t”的行?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Intuitively, you would want something involving slice():

直觉上,你会想要一些涉及slice()

df.loc[(slice(None), 't'), :]

It Just Works!? But it is clunky. We can facilitate a more natural slicing syntax using the pd.IndexSliceAPI here.

它只是有效!?但它很笨重。我们可以使用pd.IndexSlice此处的API促进更自然的切片语法。

idx = pd.IndexSlice
df.loc[idx[:, 't'], :]

This is much, much cleaner.

这要干净得多。

Note
Why is the trailing slice :across the columns required? This is because, loccan be used to select and slice along both axes (axis=0or axis=1). Without explicitly making it clear which axis the slicing is to be done on, the operation becomes ambiguous. See the big red box in the documentation on slicing.

If you want to remove any shade of ambiguity, locaccepts an axisparameter:

df.loc(axis=0)[pd.IndexSlice[:, 't']]

Without the axisparameter (i.e., just by doing df.loc[pd.IndexSlice[:, 't']]), slicing is assumed to be on the columns, and a KeyErrorwill be raised in this circumstance.

This is documented in slicers. For the purpose of this post, however, we will explicitly specify all axes.

注意
为什么:需要跨列的尾随切片?这是因为,loc可用于沿两个轴(axis=0axis=1)进行选择和切片。如果没有明确说明要在哪个轴上进行切片,操作就会变得模棱两可。请参阅有关切片文档中的大红框。

如果要消除任何歧义,请loc接受一个axis参数:

df.loc(axis=0)[pd.IndexSlice[:, 't']]

如果没有axis参数(即,只是通过执行df.loc[pd.IndexSlice[:, 't']]),则假定切片在列上,并且KeyError在这种情况下将引发a 。

这在slicers 中有记录。但是,出于本文的目的,我们将明确指定所有轴。

With xs, it is

有了xs,它是

df.xs('t', axis=0, level=1, drop_level=False)

With query, it is

有了query,它是

df.query("two == 't'")
# Or, if the first level has no name, 
# df.query("ilevel_1 == 't'") 

And finally, with get_level_values, you may do

最后,用get_level_values,你可以做

df[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']

All to the same effect.

都是一样的效果。



Question 2

How can I select rows corresponding to items "b" and "d" in level "one"?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

问题2

如何在级别“一”中选择与项目“b”和“d”相对应的行?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Using loc, this is done in a similar fashion by specifying a list.

使用 loc,这是通过指定一个列表以类似的方式完成的。

df.loc[['b', 'd']]

To solve the above problem of selecting "b" and "d", you can also use query:

要解决上述选择“b”和“d”的问题,还可以使用query

items = ['b', 'd']
df.query("one in @items")
# df.query("one == @items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')

Note
Yes, the default parser is 'pandas', but it is important to highlight this syntax isn't conventionally python. The Pandas parser generates a slightly different parse tree from the expression. This is done to make some operations more intuitive to specify. For more information, please read my post on Dynamic Expression Evaluation in pandas using pd.eval().

注意
是的,默认解析器是'pandas',但重要的是要强调这个语法不是传统的 python。Pandas 解析器生成的解析树与表达式略有不同。这样做是为了让一些操作更直观地指定。有关更多信息,请阅读我关于使用 pd.eval() 在 Pandas 中进行 动态表达式评估的文章

And, with get_level_values+ Index.isin:

并且,使用get_level_values+ Index.isin

df[df.index.get_level_values("one").isin(['b', 'd'])]


Question 2b

How would I get all values corresponding to "t" and "w" in level "two"?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

问题 2b

我如何获得“二”级中“t”和“w”对应的所有值?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

With loc, this is possible onlyin conjuction with pd.IndexSlice.

有了loc,这是可能与一起选择pd.IndexSlice

df.loc[pd.IndexSlice[:, ['t', 'w']], :] 

The first colon :in pd.IndexSlice[:, ['t', 'w']]means to slice across the first level. As the depth of the level being queried increases, you will need to specify more slices, one per level being sliced across. You will not need to specify more levels beyondthe one being sliced, however.

第一结肠:pd.IndexSlice[:, ['t', 'w']]手段切片横跨第一电平。随着被查询级别的深度增加,您将需要指定更多切片,每个切片一个切片。但是,除了被切片的级别之外,您不需要指定更多级别。

With query, this is

有了query,这是

items = ['t', 'w']
df.query("two in @items")
# df.query("two == @items", parser='pandas') 
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')

With get_level_valuesand Index.isin(similar to above):

使用get_level_valuesIndex.isin(与上面类似):

df[df.index.get_level_values('two').isin(['t', 'w'])]


Question 3

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

         col
one two     
c   u      9

问题 3

如何检索横截面,即具有特定索引值的单行来自df?具体来说,我如何检索 的横截面('c', 'u'),由

         col
one two     
c   u      9

Use locby specifying a tuple of keys:

loc通过指定一个键元组来使用:

df.loc[('c', 'u'), :]

Or,

或者,

df.loc[pd.IndexSlice[('c', 'u')]]

Note
At this point, you may run into a PerformanceWarningthat looks like this:

PerformanceWarning: indexing past lexsort depth may impact performance.

This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your DataFrame in advance using DataFrame.sort_index. This is especially desirable from a performance standpoint if you plan on doing multiple such queries in tandem:

df_sort = df.sort_index()
df_sort.loc[('c', 'u')]

You can also use MultiIndex.is_lexsorted()to check whether the index is sorted or not. This function returns Trueor Falseaccordingly. You can call this function to determine whether an additional sorting step is required or not.

注意
此时,您可能会遇到PerformanceWarning如下所示:

PerformanceWarning: indexing past lexsort depth may impact performance.

这只是意味着您的索引未排序。pandas 依赖于被排序的索引(在这种情况下,按字典顺序排列,因为我们正在处理字符串值)以获得最佳搜索和检索。一个快速的解决方法是使用DataFrame.sort_index. 如果您计划同时执行多个此类查询,那么从性能的角度来看,这是特别需要的:

df_sort = df.sort_index()
df_sort.loc[('c', 'u')]

您还可以使用MultiIndex.is_lexsorted()来检查索引是否已排序。此函数返回TrueFalse相应地。您可以调用此函数来确定是否需要额外的排序步骤。

With xs, this is again simply passing a single tuple as the first argument, with all other arguments set to their appropriate defaults:

使用xs,这再次简单地将单个元组作为第一个参数传递,所有其他参数设置为其适当的默认值:

df.xs(('c', 'u'))

With query, things become a bit clunky:

使用query,事情变得有点笨拙:

df.query("one == 'c' and two == 'u'")

You can see now that this is going to be relatively difficult to generalize. But is still OK for this particular problem.

现在您可以看到,这将相对难以概括。但是对于这个特定问题仍然可以。

With accesses spanning multiple levels, get_level_valuescan still be used, but is not recommended:

对于跨越多个级别的访问,get_level_values仍然可以使用,但不推荐:

m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]


Question 4

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

         col
one two     
c   u      9
a   w      3

问题 4

如何选择对应于('c', 'u'), 和的两行('a', 'w')

         col
one two     
c   u      9
a   w      3

With loc, this is still as simple as:

使用loc,这仍然很简单:

df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]

With query, you will need to dynamically generate a query string by iterating over your cross sections and levels:

使用query,您需要通过遍历横截面和级别来动态生成查询字符串:

cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses) 

query = '(' + ') or ('.join([
    ' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)]) 
    for cs in cses
]) + ')'

print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))

df.query(query)

100% DO NOT RECOMMEND! But it is possible.

100% 不推荐!但这是可能的。



Question 5

How can I retrieve all rows corresponding to "a" in level "one" or "t" in level "two"?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

问题 5

如何检索与级别“一”中的“a”或级别“二”中的“t”对应的所有行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

This is actually very difficult to do with locwhile ensuring correctness andstill maintaining code clarity. df.loc[pd.IndexSlice['a', 't']]is incorrect, it is interpreted as df.loc[pd.IndexSlice[('a', 't')]](i.e., selecting a cross section). You may think of a solution with pd.concatto handle each label separately:

loc在确保正确性仍然保持代码清晰度的同时实际上很难做到。df.loc[pd.IndexSlice['a', 't']]不正确,它被解释为df.loc[pd.IndexSlice[('a', 't')]](即,选择一个横截面)。您可能会想到一个pd.concat单独处理每个标签的解决方案:

pd.concat([
    df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])

         col
one two     
a   t      0
    u      1
    v      2
    w      3
    t      0   # Does this look right to you? No, it isn't!
b   t      4
    t      8
d   t     12

But you'll notice one of the rows is duplicated. This is because that row satisfied both slicing conditions, and so appeared twice. You will instead need to do

但是您会注意到其中一行是重复的。这是因为该行满足两个切片条件,因此出现了两次。你将需要做

v = pd.concat([
        df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]

But if your DataFrame inherently contains duplicate indices (that you want), then this will not retain them. Use with extreme caution.

但是,如果您的 DataFrame 固有地包含重复的索引(您想要的),那么这将不会保留它们。 谨慎使用

With query, this is stupidly simple:

使用query,这非常简单:

df.query("one == 'a' or two == 't'")

With get_level_values, this is still simple, but not as elegant:

使用get_level_values,这仍然很简单,但没有那么优雅:

m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2] 


Question 6

How can I slice specific cross sections? For "a" and "b", I would like to select all rows with sub-levels "u" and "v", and for "d", I would like to select rows with sub-level "w".

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

问题 6

如何切片特定的横截面?对于“a”和“b”,我想选择子级别为“u”和“v”的所有行,对于“d”,我想选择子级别为“w”的行。

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

This is a special case that I've added to help understand the applicability of the Four Idioms—this is one case where none of them will work effectively, since the slicing is veryspecific, and does not follow any real pattern.

这是我添加的一个特殊情况,以帮助理解四种习语的适用性——这是一种它们都不会有效工作的情况,因为切片非常具体,并且不遵循任何实际模式。

Usually, slicing problems like this will require explicitly passing a list of keys to loc. One way of doing this is with:

通常,像这样的切片问题需要显式地将键列表传递给loc. 这样做的一种方法是:

keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]

If you want to save some typing, you will recognise that there is a pattern to slicing "a", "b" and its sublevels, so we can separate the slicing task into two portions and concatthe result:

如果你想节省一些输入,你会发现有一个模式来切片“a”、“b”及其子级别,因此我们可以将切片任务分为两部分,concat结果如下:

pd.concat([
     df.loc[(('a', 'b'), ('u', 'v')), :], 
     df.loc[('d', 'w'), :]
   ], axis=0)

Slicing specification for "a" and "b" is slightly cleaner (('a', 'b'), ('u', 'v'))because the same sub-levels being indexed are the same for each level.

“a”和“b”的切片规范稍微清晰一些,(('a', 'b'), ('u', 'v'))因为被索引的相同子级别对于每个级别都是相同的。



Question 7

How do I get all rows where values in level "two" are greater than 5?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

问题 7

如何获取级别“二”中的值大于 5 的所有行?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

This can be done using query,

这可以使用query

df2.query("two > 5")

And get_level_values.

并且get_level_values

df2[df2.index.get_level_values('two') > 5]

Note
Similar to this example, we can filter based on any arbitrary condition using these constructs. In general, it is useful to remember that locand xsare specifically for label-based indexing, while queryand get_level_valuesare helpful for building general conditional masks for filtering.

注意
与此示例类似,我们可以使用这些构造基于任意条件进行过滤。一般来说,记住locxs专门用于基于标签的索引很有用,而queryget_level_values有助于构建用于过滤的通用条件掩码。



Bonus Question

What if I need to slice a MultiIndexcolumn?

奖金问题

如果我需要对MultiIndex进行切片怎么办?

Actually, most solutions here are applicable to columns as well, with minor changes. Consider:

实际上,这里的大多数解决方案也适用于列,只需稍作改动。考虑:

np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
        list('ABCD'), list('efgh')
], names=['one','two'])

df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)

one  A           B           C           D         
two  e  f  g  h  e  f  g  h  e  f  g  h  e  f  g  h
0    5  0  3  3  7  9  3  5  2  4  7  6  8  8  1  6
1    7  7  8  1  5  9  8  9  4  3  0  3  5  0  2  3
2    8  1  3  3  3  7  0  1  9  9  0  4  7  3  2  7

These are the following changes you will need to make to the Four Idioms to have them working with columns.

这些是您需要对四个习语进行的以下更改,以使它们与列一起使用。

  1. To slice with loc, use

    df3.loc[:, ....] # Notice how we slice across the index with `:`. 
    

    or,

    df3.loc[:, pd.IndexSlice[...]]
    
  2. To use xsas appropriate, just pass an argument axis=1.

  3. You can access the column level values directly using df.columns.get_level_values. You will then need to do something like

    df.loc[:, {condition}] 
    

    Where {condition}represents some condition built using columns.get_level_values.

  4. To use query, your only option is to transpose, query on the index, and transpose again:

    df3.T.query(...).T
    

    Not recommended, use one of the other 3 options.

  1. 要切片loc,请使用

    df3.loc[:, ....] # Notice how we slice across the index with `:`. 
    

    或者,

    df3.loc[:, pd.IndexSlice[...]]
    
  2. xs适当使用,只需传递一个参数axis=1

  3. 您可以使用 直接访问列级值df.columns.get_level_values。然后你需要做类似的事情

    df.loc[:, {condition}] 
    

    Where{condition}表示使用columns.get_level_values.

  4. 要使用query,您唯一的选择是转置,查询索引,然后再次转置:

    df3.T.query(...).T
    

    不推荐,使用其他 3 个选项之一。

回答by r a

Recently I came across a use case where I had a 3+ level multi-index dataframe in which I couldn't make any of the solutions above produce the results I was looking for. It's quite possible that the above solutions do of course work for my use case, and I tried several, however I was unable to get them to work with the time I had available.

最近我遇到了一个用例,其中我有一个 3+ 级多索引数据框,其中我无法使上述任何解决方案产生我正在寻找的结果。很可能上述解决方案当然适用于我的用例,我尝试了几种,但是我无法在我可用的时间内让它们工作。

I am far from expert, but I stumbled across a solution that was not listed in the comprehensive answers above. I offer no guarantee that the solutions are in any way optimal.

我远非专家,但我偶然发现了上述综合答案中未列出的解决方案。我不保证这些解决方案在任何方面都是最佳的。

This is a different way to get a slightly different result to Question #6 above. (and likely other questions as well)

这是获得与上述问题 #6 略有不同的结果的不同方法。(可能还有其他问题)

Specifically I was looking for:

具体来说,我正在寻找:

  1. A way to choose two+ values from one level of the index and a single value from another level of the index, and
  2. A way to leave the index values from the previous operation in the dataframe output.
  1. 一种从索引的一个级别中选择两个+值和从另一个索引级别中选择一个值的方法,以及
  2. 一种在数据帧输出中保留上一操作的索引值的方法。

As a monkey wrench in the gears (however totally fixable):

作为齿轮中的活动扳手(但完全可以修复):

  1. The indexes were unnamed.
  1. 索引未命名。

On the toy dataframe below:

在下面的玩具数据框中:

    index = pd.MultiIndex.from_product([['a','b'],
                               ['stock1','stock2','stock3'],
                               ['price','volume','velocity']])

    df = pd.DataFrame([1,2,3,4,5,6,7,8,9,
                      10,11,12,13,14,15,16,17,18], 
                       index)

                        0
    a stock1 price      1
             volume     2
             velocity   3
      stock2 price      4
             volume     5
             velocity   6
      stock3 price      7
             volume     8
             velocity   9
    b stock1 price     10
             volume    11
             velocity  12
      stock2 price     13
             volume    14
             velocity  15
      stock3 price     16
             volume    17
             velocity  18

Using the below works, of course:

当然,使用以下作品:

    df.xs(('stock1', 'velocity'), level=(1,2))

        0
    a   3
    b  12

But I wanted a different result, so my method to get that result was:

但我想要一个不同的结果,所以我得到这个结果的方法是:

   df.iloc[df.index.isin(['stock1'], level=1) & 
           df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
    b stock1 velocity  12

And if I wanted two+ values from one level and a single (or 2+) value from another level:

如果我想要一个级别的两个+值和另一个级别的单个(或 2+)值:

    df.iloc[df.index.isin(['stock1','stock3'], level=1) & 
            df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
      stock3 velocity   9
    b stock1 velocity  12
      stock3 velocity  18

The above method is probably a bit clunky, however I found it filled my needs and as a bonus was easier for me to understand and read.

上面的方法可能有点笨拙,但是我发现它满足了我的需求,而且对我来说更容易理解和阅读。