Pandas 数据框应用参考前一行来计算差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33461135/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas dataframe apply refer to previous row to calculate difference
提问by beta
I have the following pandas dataframe containing 2 columns (simplified). The first column contains player namesand the second column contains dates(datetime
objects):
我有以下包含 2 列(简化)的 Pandas 数据框。第一列包含玩家姓名,第二列包含日期(datetime
对象):
player date
A 2010-01-01
A 2010-01-09
A 2010-01-11
A 2010-01-15
B 2010-02-01
B 2010-02-10
B 2010-02-21
B 2010-02-23
I want to add a column diffwhich represents the time difference in days per player. The result should look like this:
我想添加一个列差异,表示每个玩家的天数时差。结果应如下所示:
player date diff
A 2010-01-01 0
A 2010-01-09 8
A 2010-01-11 2
A 2010-01-15 4
B 2010-02-01 0
B 2010-02-10 9
B 2010-02-21 11
B 2010-02-23 2
The first row has 0
for diff, because there is no earlier date. The second row shows 8
, because the difference between 2010-01-01
and 2010-01-09
is eight days.
第一行有0
差异,因为没有更早的日期。第二行显示8
,因为之间的差异2010-01-01
,并2010-01-09
为八天。
The problem is not calculating the day-difference between two datetime
objects. I am just not sure on how to add the new column. I know, that I have to make a groupby
first (df.groupby('player')
) and then use apply
(or maybe transform
?). However, I am stuck, because for calculating the difference, I need to refer to the previous rowin the apply-function, and I don't know how to do that, if possible at all.
问题不是计算两个datetime
物体之间的日差。我只是不确定如何添加新列。我知道,我必须先制作groupby
( df.groupby('player')
) 然后使用apply
(或者可能transform
?)。但是,我被卡住了,因为为了计算差异,我需要参考应用函数中的前一行,如果可能的话,我不知道该怎么做。
Thank you very much.
非常感谢。
UPDATE:After trying both proposed solutions below, I figured out that they did not work with my code. After much headache, I found out that my data had duplicate indices. So after I found out that I have duplicate indices, a simple df.reset_index()
solved my issue and the proposed solutions worked. Since both solutions work, but I can only mark one as correct, I will choose the more concise/shorter solution. Thanks to both of you, though!
更新:在尝试了以下两种建议的解决方案后,我发现它们不适用于我的代码。经过一番头痛,我发现我的数据有重复的索引。所以在我发现我有重复的索引后,一个简单的方法df.reset_index()
解决了我的问题,并且建议的解决方案奏效了。由于两种解决方案都有效,但我只能将其中一个标记为正确,因此我将选择更简洁/更短的解决方案。不过还是要感谢你们俩!
采纳答案by Alex Riley
You can simply write:
你可以简单地写:
df['difference'] = df.groupby('player')['date'].diff().fillna(0)
This gives the new timedelta column with the correct values:
这为新的 timedelta 列提供了正确的值:
player date difference
0 A 2010-01-01 0 days
1 A 2010-01-09 8 days
2 A 2010-01-11 2 days
3 A 2010-01-15 4 days
4 B 2010-02-01 0 days
5 B 2010-02-10 9 days
6 B 2010-02-21 11 days
7 B 2010-02-23 2 days
(I've used the name "difference" instead of "diff" to distinguish the name from the method diff
.)
(我使用名称“difference”而不是“diff”来区分名称和方法diff
。)
回答by Nader Hisham
another way if you want to implement it manually is to do the following
如果您想手动实现它的另一种方法是执行以下操作
def date_diff(df):
df['difference'] = df['date'] - df['date'].shift()
df['difference'].fillna(0 ,inplace = True)
return df
In [30]:
df_final = df.groupby(df['player']).apply(date_diff)
df_final
Out[30]:
player date difference
A 2010-01-01 0 days
A 2010-01-09 8 days
A 2010-01-11 2 days
A 2010-01-15 4 days
B 2010-02-01 0 days
B 2010-02-10 9 days
B 2010-02-21 11 days
B 2010-02-23 2 days
回答by MiaeKim
shift()
is a good function, however, if you need to avoid data duplication, I would suggest following method.
shift()
是一个很好的功能,但是,如果您需要避免数据重复,我建议使用以下方法。
def date_diff(row):
index = df.index.get_loc(row.name)
if index == 0:
return np.nan
prev_row = df.iloc[index - 1]
return row['date'] - prev_row['date']
df['difference'] = df.apply(date_diff, axis=1)