Python 从 MultiIndex 中的索引列中获取唯一值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13888468/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 09:51:27  来源:igfitidea点击:

Get unique values from index column in MultiIndex

pythonpandas

提问by seth

I know that I can get the unique values of a DataFrameby resetting the index but is there a way to avoid this step and get the unique values directly?

我知道我可以DataFrame通过重置索引来获取 a 的唯一值,但是有没有办法避免这一步并直接获取唯一值?

Given I have:

鉴于我有:

        C
 A B     
 0 one  3
 1 one  2
 2 two  1

I can do:

我可以:

df = df.reset_index()
uniq_b = df.B.unique()
df = df.set_index(['A','B'])

Is there a way built in pandas to do this?

有没有一种内置于熊猫的方法来做到这一点?

采纳答案by Andy Hayden

One way is to use index.levels:

一种方法是使用index.levels

In [11]: df
Out[11]: 
       C
A B     
0 one  3
1 one  2
2 two  1

In [12]: df.index.levels[1]
Out[12]: Index([one, two], dtype=object)

回答by 8one6

Andy Hayden's answer (index.levels[blah]) is great for some scenarios, but can lead to odd behavior in others. My understanding is that Pandas goes to great lengths to "reuse" indices when possible to avoid having the indices of lots of similarly-indexed DataFrames taking up space in memory. As a result, I've found the following annoying behavior:

安迪·海登 (Andy Hayden) 的回答 ( index.levels[blah]) 在某些情况下非常有用,但在其他情况下可能会导致奇怪的行为。我的理解是,Pandas 在可能的情况下会竭尽全力“重用”索引,以避免大量类似索引的 DataFrame 的索引占用内存空间。结果,我发现了以下令人讨厌的行为

import pandas as pd
import numpy as np

np.random.seed(0)

idx = pd.MultiIndex.from_product([['John', 'Josh', 'Alex'], list('abcde')], 
                                 names=['Person', 'Letter'])
large = pd.DataFrame(data=np.random.randn(15, 2), 
                     index=idx, 
                     columns=['one', 'two'])
small = large.loc[['Jo'==d[0:2] for d in large.index.get_level_values('Person')]]

print small.index.levels[0]
print large.index.levels[0]

Which outputs

哪些输出

Index([u'Alex', u'John', u'Josh'], dtype='object')
Index([u'Alex', u'John', u'Josh'], dtype='object')

rather than the expected

而不是预期的

Index([u'John', u'Josh'], dtype='object')
Index([u'Alex', u'John', u'Josh'], dtype='object')

As one person pointed out on the other thread, one idiom that seems very natural and works properly would be:

正如一个人在另一条帖子中指出的那样,一个看起来非常自然且工作正常的习语是:

small.index.get_level_values('Person').unique()
large.index.get_level_values('Person').unique()

I hope this helps someone else dodge the super-unexpected behavior that I ran into.

我希望这可以帮助其他人避免我遇到的超级意外行为。