没有 KeyError 的 Pandas .loc
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46305796/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas .loc without KeyError
提问by Alex Lenail
>>> pd.DataFrame([1], index=['1']).loc['2'] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['1','2']] # Succeeds, as in the answer below.
I'd like something that doesn't fail in either of
我想要在任何一个都不会失败的东西
>>> pd.DataFrame([1], index=['1']).loc['2'] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']] # KeyError
Is there a function like loc
which gracefully handles this, or some other way of expressing this query?
有没有类似的函数loc
可以优雅地处理这个问题,或者有其他表达这个查询的方式?
采纳答案by Josh
Update for @AlexLenail comment
It's a fair point that this will be slow for large lists. I did a little bit of more digging and foundthat the intersection
method is available for Indexes
and columns. I'm not sure about the algorithmic complexity but it's much faster empirically.
更新@AlexLenail 评论
对于大型列表来说这会很慢是一个公平的观点。我做了更多的挖掘,发现该intersection
方法可用于Indexes
和 列。我不确定算法的复杂性,但根据经验它要快得多。
You can do something like this.
你可以做这样的事情。
good_keys = df.index.intersection(all_keys)
df.loc[good_keys]
Or like your example
或者像你的例子
df = pd.DataFrame([1], index=['1'])
df.loc[df.index.intersection(['2'])]
Here is a little experiment below
下面是一个小实验
n = 100000
# Create random values and random string indexes
# have the bad indexes contain extra values not in DataFrame Index
rand_val = np.random.rand(n)
rand_idx = []
for x in range(n):
rand_idx.append(str(x))
bad_idx = []
for x in range(n*2):
bad_idx.append(str(x))
df = pd.DataFrame(rand_val, index=rand_idx)
df.head()
def get_valid_keys_list_comp():
# Return filtered DataFrame using list comprehension to filter keys
vkeys = [key for key in bad_idx if key in df.index.values]
return df.loc[vkeys]
def get_valid_keys_intersection():
# Return filtered DataFrame using list intersection() to filter keys
vkeys = df.index.intersection(bad_idx)
return df.loc[vkeys]
%%timeit
get_valid_keys_intersection()
# 64.5 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
get_valid_keys_list_comp()
# 6.14 s ± 457 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Original answer
原答案
I'm not sure if pandas has a built-in function to handle this but you can use Python list comprehension to filter to valid indexes with something like this.
我不确定 Pandas 是否有一个内置函数来处理这个问题,但你可以使用 Python 列表理解来过滤到有效的索引。
Given a DataFrame df2
给定一个数据帧 df2
A B C D F
test 1.0 2013-01-02 1.0 3 foo
train 1.0 2013-01-02 1.0 3 foo
test 1.0 2013-01-02 1.0 3 foo
train 1.0 2013-01-02 1.0 3 foo
You can filter your index query with this
您可以使用此过滤索引查询
keys = ['test', 'train', 'try', 'fake', 'broken']
valid_keys = [key for key in keys if key in df2.index.values]
df2.loc[valid_keys]
This will also work for columns if you use df2.columns
instead of df2.index.values
如果您使用df2.columns
而不是,这也适用于列df2.index.values
回答by jsa
I found an alternative (provided a check for df.empty is made beforehand). You could do something like this
我找到了一个替代方案(前提是事先检查了 df.empty)。你可以做这样的事情
df[df.index=='2'] -> returns either a dataframe with matched values or empty dataframe.
df[df.index=='2'] -> 返回具有匹配值的数据帧或空数据帧。
回答by stevepastelan
Using the sample dataframe from @binjip's answer:
使用@binjip 回答中的示例数据框:
import numpy as np
import pandas as pd
# Create dataframe
data = {'distance': [0, 300, 600, 1000],
'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])
keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']
Get matching records from the dataframe. NB:The dataframe index must be unique for this to work!
从数据框中获取匹配的记录。注意:数据帧索引必须是唯一的才能工作!
df.reindex(keys)
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
Virginia NaN NaN
If you want to omit missing keys:
如果要省略丢失的键:
df.reindex(df.index.intersection(keys))
distance population
Alabama 0 4.8
Alaska 300 0.7
Arizona 600 6.4
回答by binjip
It seems to work fine for me. I'm running Python 3.5 with pandas version 0.20.3.
这对我来说似乎很好。我正在使用 Pandas 0.20.3 版运行 Python 3.5。
import numpy as np
import pandas as pd
# Create dataframe
data = {'distance': [0, 300, 600, 1000],
'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])
keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']
# Create a subset of the dataframe.
df.loc[keys]
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
Virginia NaN NaN
Or if you want to exclude the NaN row:
或者,如果您想排除 NaN 行:
df.loc[keys].dropna()
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
回答by aganatra
This page https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlikehas the solution:
此页面https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike有解决方案:
In [8]: pd.DataFrame([1], index=['1']).reindex(['2'])
Out[8]:
0
2 NaN
In [8]: pd.DataFrame([1], index=['1']).reindex(['2'])
Out[8]:
0
2 NaN