没有 KeyError 的 Pandas .loc

Question

提问by Alex Lenail

>>> pd.DataFrame([1], index=['1']).loc['2']  # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']]  # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['1','2']]  # Succeeds, as in the answer below.

I'd like something that doesn't fail in either of

我想要在任何一个都不会失败的东西

>>> pd.DataFrame([1], index=['1']).loc['2']  # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']]  # KeyError

Is there a function like locwhich gracefully handles this, or some other way of expressing this query?

有没有类似的函数loc可以优雅地处理这个问题，或者有其他表达这个查询的方式？

Answer 1

采纳答案by Josh

Update for @AlexLenail comment
It's a fair point that this will be slow for large lists. I did a little bit of more digging and foundthat the intersectionmethod is available for Indexesand columns. I'm not sure about the algorithmic complexity but it's much faster empirically.

更新@AlexLenail 评论
对于大型列表来说这会很慢是一个公平的观点。我做了更多的挖掘，发现该intersection方法可用于Indexes和列。我不确定算法的复杂性，但根据经验它要快得多。

You can do something like this.

你可以做这样的事情。

good_keys = df.index.intersection(all_keys)
df.loc[good_keys]

Or like your example

或者像你的例子

df = pd.DataFrame([1], index=['1'])
df.loc[df.index.intersection(['2'])]

Here is a little experiment below

下面是一个小实验

n = 100000

# Create random values and random string indexes
# have the bad indexes contain extra values not in DataFrame Index
rand_val = np.random.rand(n)
rand_idx = []
for x in range(n):
    rand_idx.append(str(x))

bad_idx = []
for x in range(n*2):
    bad_idx.append(str(x))

df = pd.DataFrame(rand_val, index=rand_idx)
df.head()

def get_valid_keys_list_comp():
    # Return filtered DataFrame using list comprehension to filter keys
    vkeys = [key for key in bad_idx if key in df.index.values]
    return df.loc[vkeys]

def get_valid_keys_intersection():
    # Return filtered DataFrame using list intersection() to filter keys
    vkeys = df.index.intersection(bad_idx)
    return df.loc[vkeys]

%%timeit 
get_valid_keys_intersection()
# 64.5 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit 
get_valid_keys_list_comp()
# 6.14 s ± 457 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Original answer

原答案

I'm not sure if pandas has a built-in function to handle this but you can use Python list comprehension to filter to valid indexes with something like this.

我不确定 Pandas 是否有一个内置函数来处理这个问题，但你可以使用 Python 列表理解来过滤到有效的索引。

Given a DataFrame df2

给定一个数据帧 df2

           A    B       C   D    F
test    1.0 2013-01-02  1.0 3   foo
train   1.0 2013-01-02  1.0 3   foo
test    1.0 2013-01-02  1.0 3   foo
train   1.0 2013-01-02  1.0 3   foo

You can filter your index query with this

您可以使用此过滤索引查询

keys = ['test', 'train', 'try', 'fake', 'broken']
valid_keys = [key for key in keys if key in df2.index.values]
df2.loc[valid_keys]

This will also work for columns if you use df2.columnsinstead of df2.index.values

如果您使用df2.columns而不是，这也适用于列df2.index.values

Answer 2

回答by jsa

I found an alternative (provided a check for df.empty is made beforehand). You could do something like this

我找到了一个替代方案（前提是事先检查了 df.empty）。你可以做这样的事情

df[df.index=='2'] -> returns either a dataframe with matched values or empty dataframe.

df[df.index=='2'] -> 返回具有匹配值的数据帧或空数据帧。

Answer 3

回答by stevepastelan

Using the sample dataframe from @binjip's answer:

使用@binjip 回答中的示例数据框：

import numpy as np
import pandas as pd

# Create dataframe
data = {'distance': [0, 300, 600, 1000],
        'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])

keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']

Get matching records from the dataframe. NB:The dataframe index must be unique for this to work!

从数据框中获取匹配的记录。注意：数据帧索引必须是唯一的才能工作！

df.reindex(keys)

          distance  population
Alabama        0.0         4.8
Alaska       300.0         0.7
Arizona      600.0         6.4
Virginia       NaN         NaN

If you want to omit missing keys:

如果要省略丢失的键：

df.reindex(df.index.intersection(keys))

         distance  population
Alabama         0         4.8
Alaska        300         0.7
Arizona       600         6.4

Answer 4

回答by binjip

It seems to work fine for me. I'm running Python 3.5 with pandas version 0.20.3.

这对我来说似乎很好。我正在使用 Pandas 0.20.3 版运行 Python 3.5。

import numpy as np
import pandas as pd

# Create dataframe
data = {'distance': [0, 300, 600, 1000],
        'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])

keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']

# Create a subset of the dataframe.
df.loc[keys]
          distance  population
Alabama        0.0         4.8
Alaska       300.0         0.7
Arizona      600.0         6.4
Virginia       NaN         NaN

Or if you want to exclude the NaN row:

或者，如果您想排除 NaN 行：

df.loc[keys].dropna()
          distance  population
Alabama        0.0         4.8
Alaska       300.0         0.7
Arizona      600.0         6.4

Answer 5

回答by aganatra

This page https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlikehas the solution:

此页面https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike有解决方案：

In [8]: pd.DataFrame([1], index=['1']).reindex(['2']) Out[8]: 0 2 NaN

没有 KeyError 的 Pandas .loc

提问by Alex Lenail

采纳答案by Josh

回答by jsa

回答by stevepastelan

回答by binjip

回答by aganatra

相关推荐

最近更新

标签

没有 KeyError 的 Pandas .loc

提问by Alex Lenail

采纳答案by Josh

回答by jsa

回答by stevepastelan

回答by binjip

回答by aganatra

相关推荐

KeyError: 0 访问 pandas 系列中的值时

如何通过 DataFrame 扁平化 Pandas group？

pandas 替换熊猫数据框中的特定范围的值

pandas 'DataFrame' 对象没有属性 'col_name'

相关推荐

最近更新

标签