Pandas 数据框应用参考前一行来计算差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33461135/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:08:12  来源:igfitidea点击:

Pandas dataframe apply refer to previous row to calculate difference

pythonpandasdataframeapply

提问by beta

I have the following pandas dataframe containing 2 columns (simplified). The first column contains player namesand the second column contains dates(datetimeobjects):

我有以下包含 2 列(简化)的 Pandas 数据框。第一列包含玩家姓名,第二列包含日期datetime对象):

  player    date
  A         2010-01-01
  A         2010-01-09
  A         2010-01-11
  A         2010-01-15
  B         2010-02-01
  B         2010-02-10
  B         2010-02-21
  B         2010-02-23

I want to add a column diffwhich represents the time difference in days per player. The result should look like this:

我想添加一个列差异,表示每个玩家的时差。结果应如下所示:

  player    date            diff
  A         2010-01-01      0
  A         2010-01-09      8
  A         2010-01-11      2
  A         2010-01-15      4
  B         2010-02-01      0
  B         2010-02-10      9
  B         2010-02-21      11
  B         2010-02-23      2

The first row has 0for diff, because there is no earlier date. The second row shows 8, because the difference between 2010-01-01and 2010-01-09is eight days.

第一行有0差异,因为没有更早的日期。第二行显示8,因为之间的差异2010-01-01,并2010-01-09为八天。

The problem is not calculating the day-difference between two datetimeobjects. I am just not sure on how to add the new column. I know, that I have to make a groupbyfirst (df.groupby('player')) and then use apply(or maybe transform?). However, I am stuck, because for calculating the difference, I need to refer to the previous rowin the apply-function, and I don't know how to do that, if possible at all.

问题不是计算两个datetime物体之间的日差。我只是不确定如何添加新列。我知道,我必须先制作groupby( df.groupby('player')) 然后使用apply(或者可能transform?)。但是,我被卡住了,因为为了计算差异,我需要参考应用函数中的前一行,如果可能的话,我不知道该怎么做。

Thank you very much.

非常感谢。

UPDATE:After trying both proposed solutions below, I figured out that they did not work with my code. After much headache, I found out that my data had duplicate indices. So after I found out that I have duplicate indices, a simple df.reset_index()solved my issue and the proposed solutions worked. Since both solutions work, but I can only mark one as correct, I will choose the more concise/shorter solution. Thanks to both of you, though!

更新:在尝试了以下两种建议的解决方案后,我发现它们不适用于我的代码。经过一番头痛,我发现我的数据有重复的索引。所以在我发现我有重复的索引后,一个简单的方法df.reset_index()解决了我的问题,并且建议的解决方案奏效了。由于两种解决方案都有效,但我只能将其中一个标记为正确,因此我将选择更简洁/更短的解决方案。不过还是要感谢你们俩!

采纳答案by Alex Riley

You can simply write:

你可以简单地写:

df['difference'] = df.groupby('player')['date'].diff().fillna(0)

This gives the new timedelta column with the correct values:

这为新的 timedelta 列提供了正确的值:

  player       date  difference
0      A 2010-01-01      0 days
1      A 2010-01-09      8 days
2      A 2010-01-11      2 days
3      A 2010-01-15      4 days
4      B 2010-02-01      0 days
5      B 2010-02-10      9 days
6      B 2010-02-21     11 days
7      B 2010-02-23      2 days

(I've used the name "difference" instead of "diff" to distinguish the name from the method diff.)

(我使用名称“difference”而不是“diff”来区分名称和方法diff。)

回答by Nader Hisham

another way if you want to implement it manually is to do the following

如果您想手动实现它的另一种方法是执行以下操作

def date_diff(df):
    df['difference'] = df['date'] - df['date'].shift()
    df['difference'].fillna(0 ,inplace = True)
    return df

In [30]:
df_final = df.groupby(df['player']).apply(date_diff)
df_final
Out[30]:
player  date    difference
A   2010-01-01  0 days
A   2010-01-09  8 days
A   2010-01-11  2 days
A   2010-01-15  4 days
B   2010-02-01  0 days
B   2010-02-10  9 days
B   2010-02-21  11 days
B   2010-02-23  2 days

回答by MiaeKim

shift()is a good function, however, if you need to avoid data duplication, I would suggest following method.

shift()是一个很好的功能,但是,如果您需要避免数据重复,我建议使用以下方法。

def date_diff(row):
    index = df.index.get_loc(row.name)
    if index == 0:
        return np.nan
    prev_row = df.iloc[index - 1]
    return row['date'] - prev_row['date']

df['difference'] = df.apply(date_diff, axis=1)