Python 将pandas数据框中的行和前一行与数百万行进行比较的最快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29446844/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:34:13  来源:igfitidea点击:

Fastest way to compare row and previous row in pandas dataframe with millions of rows

pythonperformancepandasbigdatacython

提问by AdO

I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row.

我正在寻找解决方案来加速我编写的函数来循环遍历 Pandas 数据框并比较当前行和前一行之间的列值。

As an example, this is a simplified version of my problem:

例如,这是我的问题的简化版本:

   User  Time                 Col1  newcol1  newcol2  newcol3  newcol4
0     1     6     [cat, dog, goat]        0        0        0        0
1     1     6         [cat, sheep]        0        0        0        0
2     1    12        [sheep, goat]        0        0        0        0
3     2     3          [cat, lion]        0        0        0        0
4     2     5  [fish, goat, lemur]        0        0        0        0
5     3     9           [cat, dog]        0        0        0        0
6     4     4          [dog, goat]        0        0        0        0
7     4    11                [cat]        0        0        0        0

At the moment I have a function which loops through and calculates values for 'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1. It also looks at the first value in the arrays stored in 'Col1' and 'Col2' and updates 'newcol3' and 'newcol4' if these values have changed since the previous row.

目前,我有一个函数,它循环遍历并计算 ' newcol1' 和 ' newcol2' 的值,该函数根据User自上一行以来' '是否已更改以及 ' Time' 值的差异是否大于 1。它还查看' Col1' 和 ' Col2' 中存储的数组中的第一个值,如果这些值自上一行以来已更改,则更新 ' newcol3' 和 ' newcol4'。

Here's the pseudo-code for what I'm doing currently (since I've simplified the problem I haven't tested this but it's pretty similar to what I'm actually doing in ipython notebook):

这是我目前正在做的事情的伪代码(因为我已经简化了问题,我没有测试过,但它与我在 ipython notebook 中实际做的非常相似):

 def myJFunc(df):
...     #initialize jnum counter
...     jnum = 0;
...     #loop through each row of dataframe (not including the first/zeroeth)
...     for i in range(1,len(df)):
...             #has user changed?
...             if df.User.loc[i] == df.User.loc[i-1]:
...                     #has time increased by more than 1 (hour)?
...                     if abs(df.Time.loc[i]-df.Time.loc[i-1])>1:
...                             #update new columns
...                             df['newcol2'].loc[i-1] = 1;
...                             df['newcol1'].loc[i] = 1;
...                             #increase jnum
...                             jnum += 1;
...                     #has content changed?
...                     if df.Col1.loc[i][0] != df.Col1.loc[i-1][0]:
...                             #record this change
...                             df['newcol4'].loc[i-1] = [df.Col1.loc[i-1][0], df.Col2.loc[i][0]];
...             #different user?
...             elif df.User.loc[i] != df.User.loc[i-1]:
...                     #update new columns
...                     df['newcol1'].loc[i] = 1; 
...                     df['newcol2'].loc[i-1] = 1;
...                     #store jnum elsewhere (code not included here) and reset jnum
...                     jnum = 1;

I now need to apply this function to several million rows and it's impossibly slow so I'm trying to figure out the best way to speed it up. I've heard that Cython can increase the speed of functions but I have no experience with it (and I'm new to both pandas and python). Is it possible to pass two rows of a dataframe as arguments to the function and then use Cython to speed it up or would it be necessary to create new columns with "diff" values in them so that the function only reads from and writes to one row of the dataframe at a time, in order to benefit from using Cython? Any other speed tricks would be greatly appreciated!

我现在需要将这个函数应用到几百万行,而且它的速度慢得令人难以置信,所以我试图找出加速它的最佳方法。我听说 Cython 可以提高函数的速度,但我没有这方面的经验(而且我对 Pandas 和 Python 都是新手)。是否可以将数据帧的两行作为参数传递给函数,然后使用 Cython 来加速它,或者是否有必要创建带有“ diff”值的新列,以便函数只读取和写入一行为了从使用 Cython 中受益?任何其他速度技巧将不胜感激!

(As regards using .loc, I compared .loc, .iloc and .ix and this one was marginally faster so that's the only reason I'm using that currently)

(关于使用 .loc,我比较了 .loc、.iloc 和 .ix,这个稍微快一点,所以这是我目前使用它的唯一原因)

(Also, my Usercolumn in reality is unicode not int, which could be problematic for speedy comparisons)

(此外,我的User专栏实际上是 unicode 而不是 int,这对于快速比较来说可能是有问题的)

采纳答案by JohnE

I was thinking along the same lines as Andy, just with groupbyadded, and I think this is complementary to Andy's answer. Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diffor shift. (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.)

我和安迪的想法一样,只是groupby添加了,我认为这是对安迪答案的补充。添加 groupby 只会在执行 adiff或时将 NaN 放在第一行中shift。(请注意,这不是对确切答案的尝试,只是勾勒出一些基本技术。)

df['time_diff'] = df.groupby('User')['Time'].diff()

df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )

df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()

   User  Time                 Col1  time_diff Col1_0 Col1_0_prev
0     1     6     [cat, dog, goat]        NaN    cat         NaN
1     1     6         [cat, sheep]          0    cat         cat
2     1    12        [sheep, goat]          6  sheep         cat
3     2     3          [cat, lion]        NaN    cat         NaN
4     2     5  [fish, goat, lemur]          2   fish         cat
5     3     9           [cat, dog]        NaN    cat         NaN
6     4     4          [dog, goat]        NaN    dog         NaN
7     4    11                [cat]          7    cat         dog

As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods.

作为安迪关于存储对象的观点的后续,请注意我在这里所做的是提取列表列的第一个元素(并添加一个移位版本)。这样做你只需要进行一次昂贵的提取,然后就可以坚持标准的熊猫方法。

回答by Kirell

In your problem, it seems like you want to iterate through row pairwise. The first thing you could do is something like this:

在您的问题中,您似乎想成对地遍历行。你可以做的第一件事是这样的:

from itertools import tee, izip
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

for (idx1, row1), (idx2, row2) in pairwise(df.iterrows()):
    # you stuff

However you cannot modify row1 and row2 directly you will still need to use .loc or .iloc with the indexes.

但是,您不能直接修改 row1 和 row2,您仍然需要将 .loc 或 .iloc 与索引一起使用。

If iterrows is still too slow I suggest to do something like this:

如果 iterrows 仍然太慢,我建议做这样的事情:

  • Create a user_id column from you unicode names using pd.unique(User) and mapping the name with a dictionary to integer ids.

  • Create a delta dataframe: to a shifted dataframe with the user_id and time column you substract the original dataframe.

    df[[col1, ..]].shift() - df[[col1, ..]])
    
  • 使用 pd.unique(User) 从您的 unicode 名称创建一个 user_id 列,并将名称与字典映射到整数 id。

  • 创建一个增量数据帧:使用 user_id 和 time 列减去原始数据帧的移位数据帧。

    df[[col1, ..]].shift() - df[[col1, ..]])
    

If user_id > 0, it means that the user changed in two consecutive row. The time column can be filtered directly with delta[delta['time' > 1]] With this delta dataframe you record the changes row-wise. You can use it a a mask to update the columns you need from you original dataframe.

如果user_id > 0,则表示用户连续两行发生变化。时间列可以直接用 delta[delta['time' > 1]] 过滤 使用这个 delta 数据框,您可以按行记录更改。您可以使用它作为掩码从原始数据框中更新您需要的列。

回答by Andy Hayden

Use pandas (constructs) and vectorize your code i.e. don't use for loops, instead use pandas/numpy functions.

使用 pandas(构造)并向量化您的代码,即不要使用 for 循环,而是使用 pandas/numpy 函数。

'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1.

'newcol1' 和 'newcol2' 基于 'User' 自上一行以来是否已更改,以及 'Time' 值的差异是否大于 1。

Calculate these separately:

分别计算这些:

df['newcol1'] = df['User'].shift() == df['User']
df.ix[0, 'newcol1'] = True # possibly tweak the first row??

df['newcol1'] = (df['Time'].shift() - df['Time']).abs() > 1


It's unclear to me the purpose of Col1, but general python objects in columns doesn't scale well (you can't use fast path and the contents are scattered in memory). Most of the time you can get away with using something else...

我不清楚 Col1 的目的,但是列中的一般 python 对象不能很好地扩展(您不能使用快速路径并且内容分散在内存中)。大多数情况下,您可以使用其他东西来逃脱...



Cython is the very last option, and not needed in 99% of use-cases, but see enhancing performance section of the docsfor tips.

Cython 是最后一个选项,在 99% 的用例中不需要,但请参阅文档的增强性能部分以获取提示。