在带有 groupby 的时间序列列上使用 Pandas .diff()
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37033957/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using Pandas .diff() on a time series column with a groupby
提问by user2242044
I have a CSV
file of customer purchases in no particular order that I read into a Pandas
Dataframe
. I'd like to add a column for each purchase and show how much time has passed since the last purchase, grouped by customer. I'm not sure where it's getting the differences, but they are much too large (even if in seconds).
我有一个CSV
客户购买的文件,没有特定的顺序,我读到了Pandas
Dataframe
. 我想为每次购买添加一列,并按客户分组显示自上次购买以来已经过去了多长时间。我不确定差异在哪里,但它们太大了(即使在几秒钟内)。
CSV:
CSV:
Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015
Python:
Python:
import pandas as pd
import time
start = time.time()
data = pd.read_csv('data.csv', low_memory=False)
data = data.sort_values(by=['Customer Id', 'Purchase Date'])
data['Purchase Date'] = pd.to_datetime(data['Purchase Date'])
data['Purchase Difference'] = (data.groupby(['Customer Id'])['Purchase Date']
.diff()
.fillna('-')
)
print data
Output:
输出:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 2678400000000000
4 2322 2015-03-01 2419200000000000
0 4543 2015-01-01 -
1 4543 2015-02-05 3024000000000000
2 4543 2015-03-15 328320000000000
Desired Output:
期望输出:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01 -
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
回答by Alexander
You can just apply diff
to the Purchase Date
column once it has been converted to a Timestamp.
一旦它被转换为时间戳,您就可以应用diff
到该Purchase Date
列。
df['Purchase Date'] = pd.to_datetime(df['Purchase Date'])
df.sort_values(['Customer Id', 'Purchase Date'], inplace=True)
df['Purchase Difference'] = \
[str(n.days) + ' day' + 's' if n > pd.Timedelta(days=1) else '' if pd.notnull(n) else ""
for n in df.groupby('Customer Id', sort=False)['Purchase Date'].diff()]
>>> df
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
6 4543 2015-03-15
回答by jezrael
I think you can add to read_csv
parameter parse_dates
for parsing datetime
, sort_values
and last groupby
with diff
:
我认为您可以添加到read_csv
参数parse_dates
进行解析datetime
,sort_values
最后groupby
使用diff
:
import pandas as pd
import io
temp=u"""Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015"""
#after testing replace io.StringIO(temp) to filename
data = pd.read_csv(io.StringIO(temp), parse_dates=['Purchase Date'])
data.sort_values(by=['Customer Id', 'Purchase Date'], inplace=True)
data['Purchase Difference'] = data.groupby(['Customer Id'])['Purchase Date'].diff()
print data
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 NaT
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01 NaT
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days