在 Pandas DataFrame 中查找（仅）满足给定条件的第一行

Question

提问by peter

I have a dataframe dfwith a very long column of random positive integers:

我有一个df非常长的随机正整数列的数据框：

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

I want to determine the index of the firsteven number in the column. One way to do this is:

我想确定列中第一个偶数的索引。一种方法是：

df[df.n % 2 == 0].iloc[0]

but this involves a lot of operations (generate the indices f.n % 2 == 0, evaluate dfon those indices and finally take the first item) and is very slow. A loop like this is much quicker:

但这涉及很多操作（生成索引f.n % 2 == 0，df对这些索引进行评估，最后取第一项）并且非常慢。像这样的循环要快得多：

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break

also because the first result will be probably in the first few lines. Is there any pandas method for doing this with similar performance? Thank you.

也因为第一个结果可能在前几行。有没有类似性能的Pandas方法来做到这一点？谢谢你。

NOTE:This condition (to be an even number) is just an example. I'm looking for a solution that works for any kind of condition on the values, i.e., for a fast one-line alternative to:

注意：此条件（为偶数）只是一个示例。 我正在寻找一种适用于任何类型的值条件的解决方案，即用于快速单行替代：

df[ conditions on df.n ].iloc[0]

Answer 1

回答by Anton vBR

Did some timings and yes using a generator will normally give you quicker results

做了一些计时，是的，使用发电机通常会给你更快的结果

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

%timeit df[df.n % 2 == 0].iloc[0]
%timeit df.iloc[next(k for k,v in df.iterrows() if v.n % 2 == 0)]
%timeit df.iloc[next(t[0] for t in df.itertuples() if t.n % 2 == 0)]

I get:

我得到：

1000 loops, best of 3: 1.09 ms per loop
1000 loops, best of 3: 619 μs per loop # <-- iterrows generator
1000 loops, best of 3: 1.1 ms per loop
10000 loops, best of 3: 25 μs per loop # <--- your solution

However when you size it up:

但是，当您将其放大时：

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 1000000)})

The difference disappear:

差异消失：

10 loops, best of 3: 40.5 ms per loop 
10 loops, best of 3: 40.7 ms per loop # <--- iterrows
10 loops, best of 3: 56.9 ms per loop

Your solution is quickest, so why not use it?

您的解决方案是最快的，为什么不使用它呢？

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break

Answer 2

回答by peter

I decided for fun to play with a few possibilities. I take a dataframe:

为了好玩，我决定尝试几种可能性。我拿一个数据框：

MAX = 10**7
df = pd.DataFrame({'n': range(MAX)})

(not random this time.) I want to find the first row for which n >= Nfor some value of N. I have timed the following four versions:

（这次不是随机的。）我想找到第一行，n >= N对于N. 我计时了以下四个版本：

def getfirst_pandas(condition, df):
    return df[condition(df)].iloc[0]

def getfirst_iterrows_loop(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None

def getfirst_for_loop(condition, df):
    for j in range(len(df)):
        if condition(df.iloc[j]):
            break
    return j

def getfirst_numpy_argmax(condition, df):
    array = df.as_matrix()
    imax  = np.argmax(condition(array))
    return df.index[imax]

with N= powers of ten. Of course the numpy (optimized C) code is expected to be faster than forloops in python, but I wanted to see for which values of Npython loops are still okay.

with N= 十的幂。当然，numpy（优化的 C）代码预计比forpython 中的循环更快，但我想看看Npython 循环的哪些值仍然可以。

I timed the lines:

我对线路进行计时：

getfirst_pandas(lambda x: x.n >= N, df)
getfirst_iterrows_loop(lambda x: x.n >= N, df)
getfirst_for_loop(lambda x: x.n >= N, df)
getfirst_numpy_argmax(lambda x: x >= N, df.n)

for N = 1, 10, 100, 1000, .... This is the log-log graph of the performance:

为N = 1, 10, 100, 1000, ...。这是性能的日志日志图：

PICTURE

图片

The simple forloop is ok as long as the "first True position" is expected to be at the beginning, but then it becomes bad. The np.argmaxis the safest solution.

for只要“第一个真实位置”预计在开头，简单循环就可以了，但随后就变得糟糕了。这np.argmax是最安全的解决方案。

As you can see from the graph, the time for pandasand argmaxremain (almost) constant, because they always scan the whole array. It would be perfect to have a npor pandasmethod which doesn't.

你可以从图表中，时间看到pandas和argmax保持（几乎）恒定的，因为他们总是扫描整个阵列。拥有一个没有的nporpandas方法将是完美的。

Answer 3

回答by ajsp

Zipboth the index and column, then loop over that for faster loop speed. Zipprovides the fastest looping performance, faster than iterrows()or itertuples().

Zip索引和列，然后循环它以获得更快的循环速度。Zip提供最快的循环性能，比iterrows()or快itertuples()。

for j in zip(df.index,df.n):
        if j[1] % 2 == 0:
                index_position = j[0]
                break

Answer 4

回答by Thomas Fauskanger

An option to let you iterate rows and stop when you're satisfied, is to use the DataFrame.iterrows, which is pandas' row iterator.

让您迭代行并在满意时停止的一个选项是使用 DataFrame.iterrows，它是Pandas的行迭代器。

In this case you could implement it something like this:

在这种情况下，您可以像这样实现它：

def get_first_row_with(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None # Condition not met on any row in entire DataFrame

Then, given a DataFrame, e.g.:

然后，给定一个 DataFrame，例如：

df = pd.DataFrame({
                    'cats': [1,2,3,4], 
                    'dogs': [2,4,6,8]
                  }, 
                  index=['Alice', 'Bob', 'Charlie', 'Eve'])

That you can use as:

您可以将其用作：

def some_condition(row):
    return row.cats + row.dogs >= 7

index, row = get_first_row_with(some_condition, df)

# Use results however you like, e.g.:
print('{} is the first person to have at least 7 pets.'.format(index))
print('They have {} cats and {} dogs!'.format(row.cats, row.dogs))

Which would output:

这将输出：

Charlie is the first person to have at least 7 pets.
They have 3 cats and 6 dogs!

Answer 5

回答by EdG

TLDR: You can use next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)

TLDR：您可以使用 next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)

I think it is perfectly possible to do your code in a oneliner. Let's define a DataFrame to prove this:

我认为完全可以在 oneliner 中执行您的代码。让我们定义一个 DataFrame 来证明这一点：

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 100000)})

First, you code gives:

首先，您的代码给出：

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break
% 22.1 μs ± 1.5 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Converting that to a oneliner gives:

将其转换为 oneliner 给出：

next(j for j in range(len(df)) if df["n"].iloc[j] % 2 == 0)
% 20.6 μs ± 1.26 μs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

To further speed up the calculation, we can make use of atinstead of iloc, as this is faster when accessing single values:

为了进一步加快计算速度，我们可以使用at代替iloc，因为这在访问单个值时更快：

next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)
% 8.88 μs ± 617 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

在 Pandas DataFrame 中查找（仅）满足给定条件的第一行

提问by peter

回答by Anton vBR

回答by peter

回答by ajsp

回答by Thomas Fauskanger

回答by EdG

相关推荐

最近更新

标签

在 Pandas DataFrame 中查找（仅）满足给定条件的第一行

提问by peter

回答by Anton vBR

回答by peter

回答by ajsp

回答by Thomas Fauskanger

回答by EdG

相关推荐

pandas 如何在熊猫数据框列中插入逗号作为千位分隔符？

pandas 0.21.0 时间戳与 matplotlib 的兼容性问题

pandas 散景“utf8”编解码器无法解码字节 0xe9：数据意外结束

pandas 熊猫选择所有没有 NaN 的列

相关推荐

最近更新

标签