Python 有效地选择与 Pandas DataFrame 中多个值之一匹配的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22485375/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficiently select rows that match one of several values in Pandas DataFrame
提问by MRocklin
Problem
问题
Given data in a Pandas DataFrame like the following:
给定 Pandas DataFrame 中的数据,如下所示:
Name Amount
---------------
Alice 100
Bob 50
Charlie 200
Alice 30
Charlie 10
I want to select all rows where the Nameis one of several values in a collection {Alice, Bob}
我想选择Name集合中多个值之一的所有行{Alice, Bob}
Name Amount
---------------
Alice 100
Bob 50
Alice 30
Question
题
What is an efficient way to do this in Pandas?
在 Pandas 中执行此操作的有效方法是什么?
Options as I see them
我看到的选项
- Loop through rows, handling the logic with Python
Select and merge many statements like the following
merge(df[df.name = specific_name] for specific_name in names) # something like thisPerform some sort of join
- 循环遍历行,使用 Python 处理逻辑
选择并合并许多语句,如下所示
merge(df[df.name = specific_name] for specific_name in names) # something like this执行某种连接
What are the performance trade-offs here? When is one solution better than the others? What solutions am I missing?
这里的性能权衡是什么?什么时候一种解决方案比其他解决方案更好?我缺少什么解决方案?
While the example above uses strings my actual job uses matches on 10-100 integers over millions of rows and so fast NumPy operations may be relevant.
虽然上面的示例使用字符串,但我的实际工作使用了数百万行中 10-100 个整数的匹配,因此快速的 NumPy 操作可能是相关的。
采纳答案by Andy Hayden
回答by unutbu
Since, in your actual use case, the values in df['Name']are ints, you might be able to generate the boolean mask faster using NumPy indexing instead of Series.isin.
由于在您的实际用例中, 中的值df['Name']是ints,因此您可以使用 NumPy 索引而不是Series.isin.
idx = np.zeros(N, dtype='bool')
idx[names] = True
df[idx[df['Name'].values]]
For example, given this setup:
例如,鉴于此设置:
import pandas as pd
import numpy as np
N = 100000
df = pd.DataFrame(np.random.randint(N, size=(10**6, 2)), columns=['Name', 'Amount'])
names = np.random.choice(np.arange(N), size=100, replace=False)
In [81]: %timeit idx = np.zeros(N, dtype='bool'); idx[names] = True; df[idx[df['Name'].values]]
100 loops, best of 3: 9.88 ms per loop
In [82]: %timeit df[df.Name.isin(names)]
10 loops, best of 3: 107 ms per loop
In [83]: 107/9.88
Out[83]: 10.82995951417004
Nis (essentially) the maximum value that df['Names']can attain.
If Nis smaller, the speed benefit is not as large. With N = 200,
N是(本质上)df['Names']可以达到的最大值。如果N较小,则速度优势就没有那么大。与N = 200,
In [93]: %timeit idx = np.zeros(N, dtype='bool'); idx[names] = True; df[idx[df['Name'].values]]
10 loops, best of 3: 62.6 ms per loop
In [94]: %timeit df[df.Name.isin(names)]
10 loops, best of 3: 178 ms per loop
In [95]: 178/62.6
Out[95]: 2.8434504792332267
Caution: As shown above, there seems to be a speed benefit, particularly as Ngets large. However, if Nis too large, then forming idx = np.zeros(N, dtype='bool')may not be feasible.
注意:如上所示,速度似乎有好处,尤其是在N变大时。但是,如果N太大,则成型idx = np.zeros(N, dtype='bool')可能不可行。
Sanity check:
完整性检查:
expected = df[df.Name.isin(names)]
idx = np.zeros(N, dtype='bool')
idx[names] = True
result = df[idx[df['Name'].values]]
assert expected.equals(result)

