在 Pandas DataFrame 中查找(仅)满足给定条件的第一行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47601118/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:51:18  来源:igfitidea点击:

Find (only) the first row satisfying a given condition in pandas DataFrame

pythonpandas

提问by peter

I have a dataframe dfwith a very long column of random positive integers:

我有一个df非常长的随机正整数列的数据框:

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

I want to determine the index of the firsteven number in the column. One way to do this is:

我想确定列中第一个偶数的索引。一种方法是:

df[df.n % 2 == 0].iloc[0]

but this involves a lot of operations (generate the indices f.n % 2 == 0, evaluate dfon those indices and finally take the first item) and is very slow. A loop like this is much quicker:

但这涉及很多操作(生成索引f.n % 2 == 0df对这些索引进行评估,最后取第一项)并且非常慢。像这样的循环要快得多:

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break

also because the first result will be probably in the first few lines. Is there any pandas method for doing this with similar performance? Thank you.

也因为第一个结果可能在前几行。有没有类似性能的Pandas方法来做到这一点?谢谢你。

NOTE:This condition (to be an even number) is just an example. I'm looking for a solution that works for any kind of condition on the values, i.e., for a fast one-line alternative to:

注意:此条件(为偶数)只是一个示例。 我正在寻找一种适用于任何类型的值条件的解决方案,即用于快速单行替代:

df[ conditions on df.n ].iloc[0]

回答by Anton vBR

Did some timings and yes using a generator will normally give you quicker results

做了一些计时,是的,使用发电机通常会给你更快的结果

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

%timeit df[df.n % 2 == 0].iloc[0]
%timeit df.iloc[next(k for k,v in df.iterrows() if v.n % 2 == 0)]
%timeit df.iloc[next(t[0] for t in df.itertuples() if t.n % 2 == 0)]

I get:

我得到:

1000 loops, best of 3: 1.09 ms per loop
1000 loops, best of 3: 619 μs per loop # <-- iterrows generator
1000 loops, best of 3: 1.1 ms per loop
10000 loops, best of 3: 25 μs per loop # <--- your solution

However when you size it up:

但是,当您将其放大时:

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 1000000)})

The difference disappear:

差异消失:

10 loops, best of 3: 40.5 ms per loop 
10 loops, best of 3: 40.7 ms per loop # <--- iterrows
10 loops, best of 3: 56.9 ms per loop

Your solution is quickest, so why not use it?

您的解决方案是最快的,为什么不使用它呢?

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break

回答by peter

I decided for fun to play with a few possibilities. I take a dataframe:

为了好玩,我决定尝试几种可能性。我拿一个数据框:

MAX = 10**7
df = pd.DataFrame({'n': range(MAX)})

(not random this time.) I want to find the first row for which n >= Nfor some value of N. I have timed the following four versions:

(这次不是随机的。)我想找到第一行,n >= N对于N. 我计时了以下四个版本:

def getfirst_pandas(condition, df):
    return df[condition(df)].iloc[0]

def getfirst_iterrows_loop(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None

def getfirst_for_loop(condition, df):
    for j in range(len(df)):
        if condition(df.iloc[j]):
            break
    return j

def getfirst_numpy_argmax(condition, df):
    array = df.as_matrix()
    imax  = np.argmax(condition(array))
    return df.index[imax]

with N= powers of ten. Of course the numpy (optimized C) code is expected to be faster than forloops in python, but I wanted to see for which values of Npython loops are still okay.

with N= 十的幂。当然,numpy(优化的 C)代码预计比forpython 中的循环更快,但我想看看Npython 循环的哪些值仍然可以。

I timed the lines:

我对线路进行计时:

getfirst_pandas(lambda x: x.n >= N, df)
getfirst_iterrows_loop(lambda x: x.n >= N, df)
getfirst_for_loop(lambda x: x.n >= N, df)
getfirst_numpy_argmax(lambda x: x >= N, df.n)

for N = 1, 10, 100, 1000, .... This is the log-log graph of the performance:

N = 1, 10, 100, 1000, ...。这是性能的日志日志图:

PICTURE

图片

The simple forloop is ok as long as the "first True position" is expected to be at the beginning, but then it becomes bad. The np.argmaxis the safest solution.

for只要“第一个真实位置”预计在开头,简单循环就可以了,但随后就变得糟糕了。这np.argmax是最安全的解决方案。

As you can see from the graph, the time for pandasand argmaxremain (almost) constant, because they always scan the whole array. It would be perfect to have a npor pandasmethod which doesn't.

你可以从图表中,时间看到pandasargmax保持(几乎)恒定的,因为他们总是扫描整个阵列。拥有一个没有的nporpandas方法将是完美的。

回答by ajsp

Zipboth the index and column, then loop over that for faster loop speed. Zipprovides the fastest looping performance, faster than iterrows()or itertuples().

Zip索引和列,然后循环它以获得更快的循环速度。Zip提供最快的循环性能,比iterrows()or快itertuples()

for j in zip(df.index,df.n):
        if j[1] % 2 == 0:
                index_position = j[0]
                break

回答by Thomas Fauskanger

An option to let you iterate rows and stop when you're satisfied, is to use the DataFrame.iterrows, which is pandas' row iterator.

让您迭代行并在满意时停止的一个选项是使用 DataFrame.iterrows,它是Pandas的行迭代器。

In this case you could implement it something like this:

在这种情况下,您可以像这样实现它:

def get_first_row_with(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None # Condition not met on any row in entire DataFrame

Then, given a DataFrame, e.g.:

然后,给定一个 DataFrame,例如:

df = pd.DataFrame({
                    'cats': [1,2,3,4], 
                    'dogs': [2,4,6,8]
                  }, 
                  index=['Alice', 'Bob', 'Charlie', 'Eve'])

That you can use as:

您可以将其用作:

def some_condition(row):
    return row.cats + row.dogs >= 7

index, row = get_first_row_with(some_condition, df)

# Use results however you like, e.g.:
print('{} is the first person to have at least 7 pets.'.format(index))
print('They have {} cats and {} dogs!'.format(row.cats, row.dogs))

Which would output:

这将输出:

Charlie is the first person to have at least 7 pets.
They have 3 cats and 6 dogs!

回答by EdG

TLDR: You can use next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)

TLDR:您可以使用 next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)



I think it is perfectly possible to do your code in a oneliner. Let's define a DataFrame to prove this:

我认为完全可以在 oneliner 中执行您的代码。让我们定义一个 DataFrame 来证明这一点:

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 100000)})

First, you code gives:

首先,您的代码给出:

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break
% 22.1 μs ± 1.5 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Converting that to a oneliner gives:

将其转换为 oneliner 给出:

next(j for j in range(len(df)) if df["n"].iloc[j] % 2 == 0)
% 20.6 μs ± 1.26 μs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

To further speed up the calculation, we can make use of atinstead of iloc, as this is faster when accessing single values:

为了进一步加快计算速度,我们可以使用at代替iloc,因为这在访问单个值时更快:

next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)
% 8.88 μs ± 617 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)