从 Pandas MultiIndex 中选择列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18470323/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:06:45  来源:igfitidea点击:

Selecting columns from pandas MultiIndex

pythonpandasmulti-indexhierarchical

提问by metakermit

I have DataFrame with MultiIndex columns that looks like this:

我有一个带有 MultiIndex 列的 DataFrame,如下所示:

# sample data
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data

sample data

样本数据

What is the proper, simple way of selecting only specific columns (e.g. ['a', 'c'], not a range) from the second level?

['a', 'c']从第二级只选择特定列(例如,不是范围)的正确、简单的方法是什么?

Currently I am doing it like this:

目前我是这样做的:

import itertools
tuples = [i for i in itertools.product(['one', 'two'], ['a', 'c'])]
new_index = pd.MultiIndex.from_tuples(tuples)
print(new_index)
data.reindex_axis(new_index, axis=1)

expected result

预期结果

It doesn't feel like a good solution, however, because I have to bust out itertools, build another MultiIndex by hand and then reindex (and my actual code is even messier, since the column lists aren't so simple to fetch). I am pretty sure there has to be some ixor xsway of doing this, but everything I tried resulted in errors.

然而,这感觉不是一个好的解决方案,因为我必须退出itertools,手动构建另一个 MultiIndex 然后重新索引(我的实际代码甚至更混乱,因为获取列列表不是那么简单)。我很确定必须有一些方法ixxs方法来做到这一点,但我尝试的一切都会导致错误。

采纳答案by DSM

It's not great, but maybe:

这不是很好,但也许:

>>> data
        one                           two                    
          a         b         c         a         b         c
0 -0.927134 -1.204302  0.711426  0.854065 -0.608661  1.140052
1 -0.690745  0.517359 -0.631856  0.178464 -0.312543 -0.418541
2  1.086432  0.194193  0.808235 -0.418109  1.055057  1.886883
3 -0.373822 -0.012812  1.329105  1.774723 -2.229428 -0.617690
>>> data.loc[:,data.columns.get_level_values(1).isin({"a", "c"})]
        one                 two          
          a         c         a         c
0 -0.927134  0.711426  0.854065  1.140052
1 -0.690745 -0.631856  0.178464 -0.418541
2  1.086432  0.808235 -0.418109  1.886883
3 -0.373822  1.329105  1.774723 -0.617690

would work?

会工作?

回答by Viktor Kerkez

You can use either, locor ixI'll show an example with loc:

您可以使用其中之一,loc或者ix我将展示一个示例loc

data.loc[:, [('one', 'a'), ('one', 'c'), ('two', 'a'), ('two', 'c')]]

When you have a MultiIndexed DataFrame, and you want to filter out only some of the columns, you have to pass a list of tuples that match those columns. So the itertools approach was pretty much OK, but you don't have to create a new MultiIndex:

当您有一个 MultiIndexed DataFrame 并且您只想过滤掉一些列时,您必须传递与这些列匹配的元组列表。所以 itertools 方法非常好,但您不必创建新的 MultiIndex:

data.loc[:, list(itertools.product(['one', 'two'], ['a', 'c']))]

回答by FooBar

I think there is a much better way (now), which is why I bother pulling this question (which was the top google result) out of the shadows:

我认为有一个更好的方法(现在),这就是为什么我费心把这个问题(这是谷歌的最高结果)从阴影中拉出来:

data.select(lambda x: x[1] in ['a', 'b'], axis=1)

gives your expected output in a quick and clean one-liner:

以快速干净的单行方式提供您的预期输出:

        one                 two          
          a         b         a         b
0 -0.341326  0.374504  0.534559  0.429019
1  0.272518  0.116542 -0.085850 -0.330562
2  1.982431 -0.420668 -0.444052  1.049747
3  0.162984 -0.898307  1.762208 -0.101360

It is mostly self-explaining, the [1]refers to the level.

它主要是不言自明的,[1]指的是级别。

回答by Marc P.

To select all columns named 'a'and 'c'at the second level of your column indexer, you can use slicers:

要选择指定的所有列'a',并'c'在您的列索引的第二级,您可以用切片机:

>>> data.loc[:, (slice(None), ('a', 'c'))]

        one                 two          
          a         c         a         c
0 -0.983172 -2.495022 -0.967064  0.124740
1  0.282661 -0.729463 -0.864767  1.716009
2  0.942445  1.276769 -0.595756 -0.973924
3  2.182908 -0.267660  0.281916 -0.587835

Hereyou can read more about slicers.

在这里您可以阅读有关切片器的更多信息。

回答by cs95

ixand selectare deprecated!

ixselect已弃用!

The use of pd.IndexSlicemakes loca more preferable option to ixand select.

使用的pd.IndexSlice品牌loc更可取的选择,以ixselect



DataFrame.locwith pd.IndexSlice

DataFrame.locpd.IndexSlice

# Setup
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame('x', index=range(4), columns=col)
data

  one       two      
    a  b  c   a  b  c
0   x  x  x   x  x  x
1   x  x  x   x  x  x
2   x  x  x   x  x  x
3   x  x  x   x  x  x

data.loc[:, pd.IndexSlice[:, ['a', 'c']]]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

You can alternatively an axisparameter to locto make it explicit which axis you're indexing from:

您也可以使用一个axis参数来loc明确您正在索引的轴:

data.loc(axis=1)[pd.IndexSlice[:, ['a', 'c']]]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x


MultiIndex.get_level_values

MultiIndex.get_level_values

Calling data.columns.get_level_valuesto filter with locis another option:

调用data.columns.get_level_values过滤器loc是另一种选择:

data.loc[:, data.columns.get_level_values(1).isin(['a', 'c'])]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

This can naturally allow for filtering on any conditional expression on a single level. Here's a random example with lexicographical filtering:

这自然可以允许在单个级别上过滤任何条件表达式。这是一个带有字典过滤的随机示例:

data.loc[:, data.columns.get_level_values(1) > 'b']

  one two
    c   c
0   x   x
1   x   x
2   x   x
3   x   x


More information on slicing and filtering MultiIndexes can be found at Select rows in pandas MultiIndex DataFrame.

有关切片和过滤 MultiIndex 的更多信息,请参见在Pandas MultiIndex DataFrame选择行

回答by Guilherme Salomé

The most straightforward way is with .loc:

最直接的方法是.loc

>>> data.loc[:, (['one', 'two'], ['a', 'b'])]


   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

Remember that []and ()have special meaning when dealing with a MultiIndexobject:

记住这一点[]()在处理MultiIndex对象时具有特殊含义:

(...) a tuple is interpreted as one multi-levelkey

(...) a list is used to specify several keys [on the same level]

(...) a tuple of lists refer to several values within a level

(...) 元组被解释为一个多级

(...) 一个列表用于指定几个键 [在同一级别]

(...) 一个列表元组引用一个级别中的几个值

When we write (['one', 'two'], ['a', 'b']), the first list inside the tuple specifies all the values we want from the 1st level of the MultiIndex. The second list inside the tuple specifies all the values we want from the 2nd level of the MultiIndex.

当我们编写 时(['one', 'two'], ['a', 'b']),元组中的第一个列表指定了我们想要从MultiIndex. 元组中的第二个列表指定了我们想要从MultiIndex.

Edit 1:Another possibility is to use slice(None)to specify that we want anything from the first level (works similarly to slicing with :in lists). And then specify which columns from the second level we want.

编辑 1:另一种可能性是用于slice(None)指定我们想要来自第一级的任何内容(工作方式类似于:在列表中切片)。然后指定我们想要的第二级的哪些列。

>>> data.loc[:, (slice(None), ["a", "b"])]

   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

If the syntax slice(None)does appeal to you, then another possibility is to use pd.IndexSlice, which helps slicing frames with more elaborate indices.

如果语法slice(None)确实吸引您,那么另一种可能性是使用pd.IndexSlice,这有助于使用更精细的索引对帧进行切片。

>>> data.loc[:, pd.IndexSlice[:, ["a", "b"]]]

   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

When using pd.IndexSlice, we can use :as usual to slice the frame.

使用时pd.IndexSlice,我们可以:像往常一样使用对帧进行切片。

Source: MultiIndex / Advanced Indexing, How to use slice(None)

来源:MultiIndex/Advanced Indexing如何使用slice(None)

回答by Nick P

A slightly easier, to my mind, riff on Marc P.'s answer using slice:

在我看来,稍微简单一点,即使用 sliceMarc P.回答进行即兴演奏:

import pandas as pd
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'], ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)

data.loc[:, pd.IndexSlice[:, ['a', 'c']]]

        one                 two          
          a         c         a         c
0 -1.731008  0.718260 -1.088025 -1.489936
1 -0.681189  1.055909  1.825839  0.149438
2 -1.674623  0.769062  1.857317  0.756074
3  0.408313  1.291998  0.833145 -0.471879

As of pandas 0.21 or so, .select is deprecated in favour of .loc.

从 pandas 0.21 左右开始,不推荐使用 .select 以支持 .loc