在带有 groupby 的时间序列列上使用 Pandas .diff()

Question

提问by user2242044

I have a CSVfile of customer purchases in no particular order that I read into a PandasDataframe. I'd like to add a column for each purchase and show how much time has passed since the last purchase, grouped by customer. I'm not sure where it's getting the differences, but they are much too large (even if in seconds).

我有一个CSV客户购买的文件，没有特定的顺序，我读到了PandasDataframe. 我想为每次购买添加一列，并按客户分组显示自上次购买以来已经过去了多长时间。我不确定差异在哪里，但它们太大了（即使在几秒钟内）。

CSV:

CSV：

Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015

Python:

Python：

import pandas as pd
import time
start = time.time()
data = pd.read_csv('data.csv', low_memory=False)
data = data.sort_values(by=['Customer Id', 'Purchase Date'])
data['Purchase Date'] = pd.to_datetime(data['Purchase Date'])
data['Purchase Difference'] = (data.groupby(['Customer Id'])['Purchase Date']
                         .diff()
                         .fillna('-')
                       )
print data

Output:

输出：

    Customer Id Purchase Date Purchase Difference
3         2322    2015-01-01                   -
5         2322    2015-02-01    2678400000000000
4         2322    2015-03-01    2419200000000000
0         4543    2015-01-01                   -
1         4543    2015-02-05    3024000000000000
2         4543    2015-03-15    328320000000000

Desired Output:

期望输出：

   Customer Id Purchase Date  Purchase Difference
3         2322    2015-01-01                  -
5         2322    2015-02-01              31 days
4         2322    2015-03-01              28 days
0         4543    2015-01-01                  -
1         4543    2015-02-05              35 days
2         4543    2015-03-15              38 days

Answer 1

回答by Alexander

You can just apply diffto the Purchase Datecolumn once it has been converted to a Timestamp.

一旦它被转换为时间戳，您就可以应用diff到该Purchase Date列。

df['Purchase Date'] = pd.to_datetime(df['Purchase Date'])
df.sort_values(['Customer Id', 'Purchase Date'], inplace=True)    
df['Purchase Difference'] = \
    [str(n.days) + ' day' + 's' if n > pd.Timedelta(days=1) else '' if pd.notnull(n) else "" 
     for n in df.groupby('Customer Id', sort=False)['Purchase Date'].diff()]

>>> df
   Customer Id Purchase Date Purchase Difference
3         2322    2015-01-01                    
5         2322    2015-02-01             31 days
4         2322    2015-03-01             28 days
0         4543    2015-01-01                    
1         4543    2015-02-05             35 days
2         4543    2015-03-15             38 days
6         4543    2015-03-15

Answer 2

回答by jezrael

I think you can add to read_csvparameter parse_datesfor parsing datetime, sort_valuesand last groupbywith diff:

我认为您可以添加到read_csv参数parse_dates进行解析datetime，sort_values最后groupby使用diff：

import pandas as pd
import io

temp=u"""Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015"""
#after testing replace io.StringIO(temp) to filename
data = pd.read_csv(io.StringIO(temp), parse_dates=['Purchase Date'])

data.sort_values(by=['Customer Id', 'Purchase Date'], inplace=True)

data['Purchase Difference'] = data.groupby(['Customer Id'])['Purchase Date'].diff()
print data
   Customer Id Purchase Date  Purchase Difference
3         2322    2015-01-01                  NaT
5         2322    2015-02-01              31 days
4         2322    2015-03-01              28 days
0         4543    2015-01-01                  NaT
1         4543    2015-02-05              35 days
2         4543    2015-03-15              38 days

在带有 groupby 的时间序列列上使用 Pandas .diff()

提问by user2242044

回答by Alexander

回答by jezrael

相关推荐

最近更新

标签

在带有 groupby 的时间序列列上使用 Pandas .diff()

提问by user2242044

回答by Alexander

回答by jezrael

相关推荐

Pandas - 按连续范围分组

pandas 熊猫切片系列

AttributeError: 'function' 对象在 Pandas 中没有属性 'bar'

Pandas - 根据日期时间列值删除 DataFrame 行

相关推荐

最近更新

标签