没有 KeyError 的 Pandas .loc

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46305796/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:29:05  来源:igfitidea点击:

Pandas .loc without KeyError

pandas

提问by Alex Lenail

>>> pd.DataFrame([1], index=['1']).loc['2']  # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']]  # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['1','2']]  # Succeeds, as in the answer below. 

I'd like something that doesn't fail in either of

我想要在任何一个都不会失败的东西

>>> pd.DataFrame([1], index=['1']).loc['2']  # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']]  # KeyError

Is there a function like locwhich gracefully handles this, or some other way of expressing this query?

有没有类似的函数loc可以优雅地处理这个问题,或者有其他表达这个查询的方式?

采纳答案by Josh

Update for @AlexLenail comment
It's a fair point that this will be slow for large lists. I did a little bit of more digging and foundthat the intersectionmethod is available for Indexesand columns. I'm not sure about the algorithmic complexity but it's much faster empirically.

更新@AlexLenail 评论
对于大型列表来说这会很慢是一个公平的观点。我做了更多的挖掘,发现intersection方法可用于Indexes和 列。我不确定算法的复杂性,但根据经验它要快得多。

You can do something like this.

你可以做这样的事情。

good_keys = df.index.intersection(all_keys)
df.loc[good_keys]

Or like your example

或者像你的例子

df = pd.DataFrame([1], index=['1'])
df.loc[df.index.intersection(['2'])]

Here is a little experiment below

下面是一个小实验

n = 100000

# Create random values and random string indexes
# have the bad indexes contain extra values not in DataFrame Index
rand_val = np.random.rand(n)
rand_idx = []
for x in range(n):
    rand_idx.append(str(x))

bad_idx = []
for x in range(n*2):
    bad_idx.append(str(x))

df = pd.DataFrame(rand_val, index=rand_idx)
df.head()

def get_valid_keys_list_comp():
    # Return filtered DataFrame using list comprehension to filter keys
    vkeys = [key for key in bad_idx if key in df.index.values]
    return df.loc[vkeys]

def get_valid_keys_intersection():
    # Return filtered DataFrame using list intersection() to filter keys
    vkeys = df.index.intersection(bad_idx)
    return df.loc[vkeys]

%%timeit 
get_valid_keys_intersection()
# 64.5 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit 
get_valid_keys_list_comp()
# 6.14 s ± 457 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Original answer

原答案

I'm not sure if pandas has a built-in function to handle this but you can use Python list comprehension to filter to valid indexes with something like this.

我不确定 Pandas 是否有一个内置函数来处理这个问题,但你可以使用 Python 列表理解来过滤到有效的索引。

Given a DataFrame df2

给定一个数据帧 df2

           A    B       C   D    F
test    1.0 2013-01-02  1.0 3   foo
train   1.0 2013-01-02  1.0 3   foo
test    1.0 2013-01-02  1.0 3   foo
train   1.0 2013-01-02  1.0 3   foo

You can filter your index query with this

您可以使用此过滤索引查询

keys = ['test', 'train', 'try', 'fake', 'broken']
valid_keys = [key for key in keys if key in df2.index.values]
df2.loc[valid_keys]

This will also work for columns if you use df2.columnsinstead of df2.index.values

如果您使用df2.columns而不是,这也适用于列df2.index.values

回答by jsa

I found an alternative (provided a check for df.empty is made beforehand). You could do something like this

我找到了一个替代方案(前提是事先检查了 df.empty)。你可以做这样的事情

df[df.index=='2'] -> returns either a dataframe with matched values or empty dataframe.

df[df.index=='2'] -> 返回具有匹配值的数据帧或空数据帧。

回答by stevepastelan

Using the sample dataframe from @binjip's answer:

使用@binjip 回答中的示例数据框:

import numpy as np
import pandas as pd

# Create dataframe
data = {'distance': [0, 300, 600, 1000],
        'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])

keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']

Get matching records from the dataframe. NB:The dataframe index must be unique for this to work!

从数据框中获取匹配的记录。注意:数据帧索引必须是唯一的才能工作!

df.reindex(keys)
          distance  population
Alabama        0.0         4.8
Alaska       300.0         0.7
Arizona      600.0         6.4
Virginia       NaN         NaN

If you want to omit missing keys:

如果要省略丢失的键:

df.reindex(df.index.intersection(keys))
         distance  population
Alabama         0         4.8
Alaska        300         0.7
Arizona       600         6.4

回答by binjip

It seems to work fine for me. I'm running Python 3.5 with pandas version 0.20.3.

这对我来说似乎很好。我正在使用 Pandas 0.20.3 版运行 Python 3.5。

import numpy as np
import pandas as pd

# Create dataframe
data = {'distance': [0, 300, 600, 1000],
        'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])

keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']

# Create a subset of the dataframe.
df.loc[keys]
          distance  population
Alabama        0.0         4.8
Alaska       300.0         0.7
Arizona      600.0         6.4
Virginia       NaN         NaN

Or if you want to exclude the NaN row:

或者,如果您想排除 NaN 行:

df.loc[keys].dropna()
          distance  population
Alabama        0.0         4.8
Alaska       300.0         0.7
Arizona      600.0         6.4

回答by aganatra

This page https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlikehas the solution:

此页面https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike有解决方案:

In [8]: pd.DataFrame([1], index=['1']).reindex(['2']) Out[8]: 0 2 NaN

In [8]: pd.DataFrame([1], index=['1']).reindex(['2']) Out[8]: 0 2 NaN