在 Pandas DataFrame 中查找(仅)满足给定条件的第一行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47601118/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find (only) the first row satisfying a given condition in pandas DataFrame
提问by peter
I have a dataframe df
with a very long column of random positive integers:
我有一个df
非常长的随机正整数列的数据框:
df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})
I want to determine the index of the firsteven number in the column. One way to do this is:
我想确定列中第一个偶数的索引。一种方法是:
df[df.n % 2 == 0].iloc[0]
but this involves a lot of operations (generate the indices f.n % 2 == 0
, evaluate df
on those indices and finally take the first item) and is very slow. A loop like this is much quicker:
但这涉及很多操作(生成索引f.n % 2 == 0
,df
对这些索引进行评估,最后取第一项)并且非常慢。像这样的循环要快得多:
for j in range(len(df)):
if df.n.iloc[j] % 2 == 0:
break
also because the first result will be probably in the first few lines. Is there any pandas method for doing this with similar performance? Thank you.
也因为第一个结果可能在前几行。有没有类似性能的Pandas方法来做到这一点?谢谢你。
NOTE:This condition (to be an even number) is just an example. I'm looking for a solution that works for any kind of condition on the values, i.e., for a fast one-line alternative to:
注意:此条件(为偶数)只是一个示例。 我正在寻找一种适用于任何类型的值条件的解决方案,即用于快速单行替代:
df[ conditions on df.n ].iloc[0]
回答by Anton vBR
Did some timings and yes using a generator will normally give you quicker results
做了一些计时,是的,使用发电机通常会给你更快的结果
df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})
%timeit df[df.n % 2 == 0].iloc[0]
%timeit df.iloc[next(k for k,v in df.iterrows() if v.n % 2 == 0)]
%timeit df.iloc[next(t[0] for t in df.itertuples() if t.n % 2 == 0)]
I get:
我得到:
1000 loops, best of 3: 1.09 ms per loop
1000 loops, best of 3: 619 μs per loop # <-- iterrows generator
1000 loops, best of 3: 1.1 ms per loop
10000 loops, best of 3: 25 μs per loop # <--- your solution
However when you size it up:
但是,当您将其放大时:
df = pd.DataFrame({'n': np.random.randint(1, 10, size = 1000000)})
The difference disappear:
差异消失:
10 loops, best of 3: 40.5 ms per loop
10 loops, best of 3: 40.7 ms per loop # <--- iterrows
10 loops, best of 3: 56.9 ms per loop
Your solution is quickest, so why not use it?
您的解决方案是最快的,为什么不使用它呢?
for j in range(len(df)):
if df.n.iloc[j] % 2 == 0:
break
回答by peter
I decided for fun to play with a few possibilities. I take a dataframe:
为了好玩,我决定尝试几种可能性。我拿一个数据框:
MAX = 10**7
df = pd.DataFrame({'n': range(MAX)})
(not random this time.) I want to find the first row for which n >= N
for some value of N
. I have timed the following four versions:
(这次不是随机的。)我想找到第一行,n >= N
对于N
. 我计时了以下四个版本:
def getfirst_pandas(condition, df):
return df[condition(df)].iloc[0]
def getfirst_iterrows_loop(condition, df):
for index, row in df.iterrows():
if condition(row):
return index, row
return None
def getfirst_for_loop(condition, df):
for j in range(len(df)):
if condition(df.iloc[j]):
break
return j
def getfirst_numpy_argmax(condition, df):
array = df.as_matrix()
imax = np.argmax(condition(array))
return df.index[imax]
with N
= powers of ten. Of course the numpy (optimized C) code is expected to be faster than for
loops in python, but I wanted to see for which values of N
python loops are still okay.
with N
= 十的幂。当然,numpy(优化的 C)代码预计比for
python 中的循环更快,但我想看看N
python 循环的哪些值仍然可以。
I timed the lines:
我对线路进行计时:
getfirst_pandas(lambda x: x.n >= N, df)
getfirst_iterrows_loop(lambda x: x.n >= N, df)
getfirst_for_loop(lambda x: x.n >= N, df)
getfirst_numpy_argmax(lambda x: x >= N, df.n)
for N = 1, 10, 100, 1000, ...
. This is the log-log graph of the performance:
为N = 1, 10, 100, 1000, ...
。这是性能的日志日志图:
The simple for
loop is ok as long as the "first True position" is expected to be at the beginning, but then it becomes bad. The np.argmax
is the safest solution.
for
只要“第一个真实位置”预计在开头,简单循环就可以了,但随后就变得糟糕了。这np.argmax
是最安全的解决方案。
As you can see from the graph, the time for pandas
and argmax
remain (almost) constant, because they always scan the whole array. It would be perfect to have a np
or pandas
method which doesn't.
你可以从图表中,时间看到pandas
和argmax
保持(几乎)恒定的,因为他们总是扫描整个阵列。拥有一个没有的np
orpandas
方法将是完美的。
回答by ajsp
Zip
both the index and column, then loop over that for faster loop speed. Zip
provides the fastest looping performance, faster than iterrows()
or itertuples()
.
Zip
索引和列,然后循环它以获得更快的循环速度。Zip
提供最快的循环性能,比iterrows()
or快itertuples()
。
for j in zip(df.index,df.n):
if j[1] % 2 == 0:
index_position = j[0]
break
回答by Thomas Fauskanger
An option to let you iterate rows and stop when you're satisfied, is to use the DataFrame.iterrows, which is pandas' row iterator.
让您迭代行并在满意时停止的一个选项是使用 DataFrame.iterrows,它是Pandas的行迭代器。
In this case you could implement it something like this:
在这种情况下,您可以像这样实现它:
def get_first_row_with(condition, df):
for index, row in df.iterrows():
if condition(row):
return index, row
return None # Condition not met on any row in entire DataFrame
Then, given a DataFrame, e.g.:
然后,给定一个 DataFrame,例如:
df = pd.DataFrame({
'cats': [1,2,3,4],
'dogs': [2,4,6,8]
},
index=['Alice', 'Bob', 'Charlie', 'Eve'])
That you can use as:
您可以将其用作:
def some_condition(row):
return row.cats + row.dogs >= 7
index, row = get_first_row_with(some_condition, df)
# Use results however you like, e.g.:
print('{} is the first person to have at least 7 pets.'.format(index))
print('They have {} cats and {} dogs!'.format(row.cats, row.dogs))
Which would output:
这将输出:
Charlie is the first person to have at least 7 pets.
They have 3 cats and 6 dogs!
回答by EdG
TLDR: You can use next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)
TLDR:您可以使用 next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)
I think it is perfectly possible to do your code in a oneliner. Let's define a DataFrame to prove this:
我认为完全可以在 oneliner 中执行您的代码。让我们定义一个 DataFrame 来证明这一点:
df = pd.DataFrame({'n': np.random.randint(1, 10, size = 100000)})
First, you code gives:
首先,您的代码给出:
for j in range(len(df)):
if df.n.iloc[j] % 2 == 0:
break
% 22.1 μs ± 1.5 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Converting that to a oneliner gives:
将其转换为 oneliner 给出:
next(j for j in range(len(df)) if df["n"].iloc[j] % 2 == 0)
% 20.6 μs ± 1.26 μs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
To further speed up the calculation, we can make use of at
instead of iloc
, as this is faster when accessing single values:
为了进一步加快计算速度,我们可以使用at
代替iloc
,因为这在访问单个值时更快:
next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)
% 8.88 μs ± 617 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)