Python 有效地选择与 Pandas DataFrame 中多个值之一匹配的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22485375/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:01:39  来源:igfitidea点击:

Efficiently select rows that match one of several values in Pandas DataFrame

pythonpandas

提问by MRocklin

Problem

问题

Given data in a Pandas DataFrame like the following:

给定 Pandas DataFrame 中的数据,如下所示:

Name     Amount
---------------
Alice       100
Bob          50
Charlie     200
Alice        30
Charlie      10

I want to select all rows where the Nameis one of several values in a collection {Alice, Bob}

我想选择Name集合中多个值之一的所有行{Alice, Bob}

Name     Amount
---------------
Alice       100
Bob          50
Alice        30

Question

What is an efficient way to do this in Pandas?

在 Pandas 中执行此操作的有效方法是什么?

Options as I see them

我看到的选项

  1. Loop through rows, handling the logic with Python
  2. Select and merge many statements like the following

    merge(df[df.name = specific_name] for specific_name in names) # something like this
    
  3. Perform some sort of join

  1. 循环遍历行,使用 Python 处理逻辑
  2. 选择并合并许多语句,如下所示

    merge(df[df.name = specific_name] for specific_name in names) # something like this
    
  3. 执行某种连接

What are the performance trade-offs here? When is one solution better than the others? What solutions am I missing?

这里的性能权衡是什么?什么时候一种解决方案比其他解决方案更好?我缺少什么解决方案?

While the example above uses strings my actual job uses matches on 10-100 integers over millions of rows and so fast NumPy operations may be relevant.

虽然上面的示例使用字符串,但我的实际工作使用了数百万行中 10-100 个整数的匹配,因此快速的 NumPy 操作可能是相关的。

采纳答案by Andy Hayden

You can use the isinSeries method:

您可以使用isin系列方法:

In [11]: df['Name'].isin(['Alice', 'Bob'])
Out[11]: 
0     True
1     True
2    False
3     True
4    False
Name: Name, dtype: bool

In [12]: df[df.Name.isin(['Alice', 'Bob'])]
Out[12]: 
    Name  Amount
0  Alice     100
1    Bob      50
3  Alice      30

回答by unutbu

Since, in your actual use case, the values in df['Name']are ints, you might be able to generate the boolean mask faster using NumPy indexing instead of Series.isin.

由于在您的实际用例中, 中的值df['Name']ints,因此您可以使用 NumPy 索引而不是Series.isin.

idx = np.zeros(N, dtype='bool')
idx[names] = True
df[idx[df['Name'].values]]


For example, given this setup:

例如,鉴于此设置:

import pandas as pd
import numpy as np

N = 100000
df = pd.DataFrame(np.random.randint(N, size=(10**6, 2)), columns=['Name', 'Amount'])
names = np.random.choice(np.arange(N), size=100, replace=False)


In [81]: %timeit idx = np.zeros(N, dtype='bool'); idx[names] = True; df[idx[df['Name'].values]]
100 loops, best of 3: 9.88 ms per loop

In [82]: %timeit df[df.Name.isin(names)]
10 loops, best of 3: 107 ms per loop

In [83]: 107/9.88
Out[83]: 10.82995951417004


Nis (essentially) the maximum value that df['Names']can attain. If Nis smaller, the speed benefit is not as large. With N = 200,

N是(本质上)df['Names']可以达到的最大值。如果N较小,则速度优势就没有那么大。与N = 200,

In [93]: %timeit idx = np.zeros(N, dtype='bool'); idx[names] = True; df[idx[df['Name'].values]]
10 loops, best of 3: 62.6 ms per loop

In [94]: %timeit df[df.Name.isin(names)]
10 loops, best of 3: 178 ms per loop

In [95]: 178/62.6
Out[95]: 2.8434504792332267


Caution: As shown above, there seems to be a speed benefit, particularly as Ngets large. However, if Nis too large, then forming idx = np.zeros(N, dtype='bool')may not be feasible.

注意:如上所示,速度似乎有好处,尤其是在N变大时。但是,如果N太大,则成型idx = np.zeros(N, dtype='bool')可能不可行。



Sanity check:

完整性检查:

expected = df[df.Name.isin(names)]
idx = np.zeros(N, dtype='bool')
idx[names] = True
result = df[idx[df['Name'].values]]
assert expected.equals(result)