pandas 快速熊猫过滤
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/21738882/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fast pandas filtering
提问by redrubia
I want to filter a pandas dataframe, if the name column entry has an item in a given list.
如果名称列条目在给定列表中有一个项目,我想过滤一个Pandas数据框。
Here we have a DataFrame
这里我们有一个 DataFrame
x = DataFrame(
    [['sam', 328], ['ruby', 3213], ['jon', 121]], 
    columns=['name', 'score'])
Now lets say we have a list, ['sam', 'ruby']and we want to find all rows where the name is in the list, then sum the score.
现在假设我们有一个列表,['sam', 'ruby']我们想要找到名称在列表中的所有行,然后对分数求和。
The solution I have is as follows:
我的解决方案如下:
total = 0
names = ['sam', 'ruby']
for name in names:
     identified = x[x['name'] == name]
     total = total + sum(identified['score'])
However when the dataframe gets extremely large, and the list of names gets very large too, everything is very very slow.
但是,当数据框变得非常大,并且名称列表也变得非常大时,一切都会非常非常缓慢。
Is there any faster alternative?
有没有更快的替代方案?
Thanks
谢谢
回答by unutbu
Try using isin(thanks to DSM for suggesting locover ixhere):
尝试使用isin(感谢 DSMloc在ix这里提出建议):
In [78]: x = pd.DataFrame([['sam',328],['ruby',3213],['jon',121]], columns = ['name', 'score'])
In [79]: names = ['sam', 'ruby']
In [80]: x['name'].isin(names)
Out[80]: 
0     True
1     True
2    False
Name: name, dtype: bool
In [81]: x.loc[x['name'].isin(names), 'score'].sum()
Out[81]: 3541
CT Zhu suggests a faster alternative using np.in1d:
CT Zhu 建议使用np.in1d以下更快的替代方法:
In [105]: y = pd.concat([x]*1000)
In [109]: %timeit y.loc[y['name'].isin(names), 'score'].sum()
1000 loops, best of 3: 413 μs per loop
In [110]: %timeit y.loc[np.in1d(y['name'], names), 'score'].sum()
1000 loops, best of 3: 335 μs per loop
回答by Dhwani Katagade
If I need to search on a field, I have noticed that it helps immensely if I change the indexof the DataFrame to the search field. For one of my search and lookup requirements I got a performance improvement of around 500%.
如果我需要在一个字段上进行搜索,我注意到如果我将 DataFrame的索引更改为搜索字段会非常有帮助。对于我的搜索和查找要求之一,我的性能提高了大约500%。
So in your case the following could be used to search and filter by name.
因此,在您的情况下,以下内容可用于按名称搜索和过滤。
df = pd.DataFrame([['sam', 328], ['ruby', 3213], ['jon', 121]], 
                 columns=['name', 'score'])
names = ['sam', 'ruby']
df_searchable = df.set_index('name')
df_searchable[df_searchable.index.isin(names)]

