pandas 快速熊猫过滤

Question

提问by redrubia

I want to filter a pandas dataframe, if the name column entry has an item in a given list.

如果名称列条目在给定列表中有一个项目，我想过滤一个Pandas数据框。

Here we have a DataFrame

这里我们有一个 DataFrame

x = DataFrame(
    [['sam', 328], ['ruby', 3213], ['jon', 121]], 
    columns=['name', 'score'])

Now lets say we have a list, ['sam', 'ruby']and we want to find all rows where the name is in the list, then sum the score.

现在假设我们有一个列表，['sam', 'ruby']我们想要找到名称在列表中的所有行，然后对分数求和。

The solution I have is as follows:

我的解决方案如下：

total = 0
names = ['sam', 'ruby']
for name in names:
     identified = x[x['name'] == name]
     total = total + sum(identified['score'])

However when the dataframe gets extremely large, and the list of names gets very large too, everything is very very slow.

但是，当数据框变得非常大，并且名称列表也变得非常大时，一切都会非常非常缓慢。

Is there any faster alternative?

有没有更快的替代方案？

Thanks

谢谢

Answer 1

回答by unutbu

Try using isin(thanks to DSM for suggesting locover ixhere):

尝试使用isin（感谢 DSMloc在ix这里提出建议）：

In [78]: x = pd.DataFrame([['sam',328],['ruby',3213],['jon',121]], columns = ['name', 'score'])

In [79]: names = ['sam', 'ruby']

In [80]: x['name'].isin(names)
Out[80]: 
0     True
1     True
2    False
Name: name, dtype: bool

In [81]: x.loc[x['name'].isin(names), 'score'].sum()
Out[81]: 3541

CT Zhu suggests a faster alternative using np.in1d:

CT Zhu 建议使用np.in1d以下更快的替代方法：

In [105]: y = pd.concat([x]*1000)
In [109]: %timeit y.loc[y['name'].isin(names), 'score'].sum()
1000 loops, best of 3: 413 μs per loop

In [110]: %timeit y.loc[np.in1d(y['name'], names), 'score'].sum()
1000 loops, best of 3: 335 μs per loop

Answer 2

回答by Dhwani Katagade

If I need to search on a field, I have noticed that it helps immensely if I change the indexof the DataFrame to the search field. For one of my search and lookup requirements I got a performance improvement of around 500%.

如果我需要在一个字段上进行搜索，我注意到如果我将 DataFrame的索引更改为搜索字段会非常有帮助。对于我的搜索和查找要求之一，我的性能提高了大约500%。

So in your case the following could be used to search and filter by name.

因此，在您的情况下，以下内容可用于按名称搜索和过滤。

df = pd.DataFrame([['sam', 328], ['ruby', 3213], ['jon', 121]], 
                 columns=['name', 'score'])
names = ['sam', 'ruby']

df_searchable = df.set_index('name')

df_searchable[df_searchable.index.isin(names)]

pandas 快速熊猫过滤

提问by redrubia

回答by unutbu

回答by Dhwani Katagade

相关推荐

最近更新

标签

pandas 快速熊猫过滤

提问by redrubia

回答by unutbu

回答by Dhwani Katagade

相关推荐

pandas ValueError：在 LinearSVC 期间，数组在 _assert_all_finite 中包含 NaN 或无穷大

pandas 解析熊猫中的日期字符串

pandas 熊猫生成开始月份的日期范围

Pandas：链式赋值

相关推荐

最近更新

标签