pandas 在熊猫数据框中排除索引行的最有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21650809/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:40:58  来源:igfitidea点击:

Most efficient way to exclude indexed rows in pandas dataframe

pythonpandas

提问by dkapitan

I'm relatively new to Python & pandas and am struggling with (hierachical) indexes. I've got the basics covered, but am lost with more advanced slicing and cross-sectioning.

我对 Python 和 Pandas 比较陌生,并且正在努力处理(分层)索引。我已经涵盖了基础知识,但在更高级的切片和横截面中迷失了。

For example, with the following dataframe

例如,使用以下数据框

import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(9).reshape((3, 3)),
    index=pd.Index(['Ohio', 'Colorado', 'New York'], name='state'), columns=pd.Index(['one', 'two', 'three'], name='number'))

I want to select everything except the row with index 'Colorado'. For a small dataset I could do:

我想选择除索引为“科罗拉多”的行之外的所有内容。对于一个小数据集,我可以这样做:

data.ix[['Ohio','New York']]

But if the number of unique index values is large, that's impractical. Naively, I would expect a syntax like

但是如果唯一索引值的数量很大,这是不切实际的。天真地,我希望像这样的语法

data.ix[['state' != 'Colorado']]

However, this only returns the first record 'Ohio' and doesn't return 'New York'. This works, but is cumbersome

但是,这只会返回第一条记录“Ohio”,而不会返回“New York”。这有效,但很麻烦

filter = list(set(data.index.get_level_values(0).unique()) - set(['Colorado']))
data[filter]

Surely there's a more Pythonic, verbose way of doing this?

肯定有一种更 Pythonic 的、冗长的方式来做到这一点?

回答by DSM

This is a Python issue, not a pandasone: 'state' != 'Colorado'is True, so what pandasgets is data.ix[[True]].

这是一个 Python 问题,而不是pandas一个:'state' != 'Colorado'是真的,所以pandas得到的是data.ix[[True]].

You could do

你可以做

>>> data.loc[data.index != "Colorado"]
number    one  two  three
state                    
Ohio        0    1      2
New York    6    7      8

[2 rows x 3 columns]

or use DataFrame.query:

或使用DataFrame.query

>>> data.query("state != 'New York'")
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

[2 rows x 3 columns]

if you don't like the duplication of data. (Quoting the expression passed to the .query()method is one of the only ways around the fact that otherwise Python would evaluate the comparison before pandasever saw it.)

如果您不喜欢data. (引用传递给该.query()方法的表达式是绕过这一事实的唯一方法之一,否则 Python 会在pandas看到它之前评估比较。)

回答by Alexander McFarlane

This is a robust solution that will also work with MultiIndex objects

这是一个强大的解决方案,也适用于 MultiIndex 对象

Single Index

单一索引

excluded = ['Ohio']
indices = data.index.get_level_values('state').difference(excluded)
indx = pd.IndexSlice[indices.values]

The output

输出

In [77]: data.loc[indx]
Out[77]:
number    one  two  three
state
Colorado    3    4      5
New York    6    7      8

MultiIndex Extension

多索引扩展

Here I extend to a MultiIndex example...

在这里,我扩展到 MultiIndex 示例...

data = pd.DataFrame(np.arange(18).reshape(6,3), index=pd.MultiIndex(levels=[[u'AU', u'UK'], [u'Derby', u'Kensington', u'Newcastle', u'Sydney']], labels=[[0, 0, 0, 1, 1, 1], [0, 2, 3, 0, 1, 2]], names=[u'country', u'town']), columns=pd.Index(['one', 'two', 'three'], name='number'))

Assume we want to exclude 'Newcastle'from both examples in this new MultiIndex

假设我们'Newcastle'要从这个新的 MultiIndex 中的两个示例中排除

excluded = ['Newcastle']
indices = data.index.get_level_values('town').difference(excluded)
indx = pd.IndexSlice[:, indices.values]

Which gives the expected result

这给出了预期的结果

In [115]: data.loc[indx, :]
Out[115]:
number              one  two  three
country town
AU      Derby         0    1      2
        Sydney        3    4      5
UK      Derby         0    1      2
        Kensington    3    4      5

Common Pitfalls

常见的陷阱

  1. Make sure that all levels of your index are sorted, you require data.sort_index(inplace=True)
  2. Make sure you include the null slice for columns data.loc[indx, :]
  3. Sometimes indx = pd.IndexSlice[:, indices]is enough but I found that often I needed to use indx = pd.IndexSlice[:, indices.values]
  1. 确保索引的所有级别都已排序,您需要 data.sort_index(inplace=True)
  2. 确保包含列的空切片 data.loc[indx, :]
  3. 有时indx = pd.IndexSlice[:, indices]就足够了,但我发现我经常需要使用indx = pd.IndexSlice[:, indices.values]