pandas 熊猫:过去 n 天的平均值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36969174/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:09:04  来源:igfitidea点击:

Pandas: Average value for the past n days

pythonpandastime-seriesaggregation

提问by ahoosh

I have a Pandasdata frame like this:

我有一个这样的Pandas数据框:

test = pd.DataFrame({ 'Date' : ['2016-04-01','2016-04-01','2016-04-02',
                             '2016-04-02','2016-04-03','2016-04-04',
                             '2016-04-05','2016-04-06','2016-04-06'],
                      'User' : ['Mike','John','Mike','John','Mike','Mike',
                             'Mike','Mike','John'],
                      'Value' : [1,2,1,3,4.5,1,2,3,6]
                })

As you can see below, the data set does not have observations for every day necessarily:

正如您在下面看到的,数据集不一定每天都有观察:

         Date  User  Value
0  2016-04-01  Mike    1.0
1  2016-04-01  John    2.0
2  2016-04-02  Mike    1.0
3  2016-04-02  John    3.0
4  2016-04-03  Mike    4.5
5  2016-04-04  Mike    1.0
6  2016-04-05  Mike    2.0
7  2016-04-06  Mike    3.0
8  2016-04-06  John    6.0

I'd like to add a new column which shows the average value for each user for the past n days (in this case n = 2) if at least one day is available, else it would have nanvalue. For example, on 2016-04-06John gets a nanbecause he has no data for 2016-04-05and 2016-04-04. So the result will be something like this:

如果至少有一天可用,我想添加一个新列,该列显示过去 n 天(在本例中 n = 2)每个用户的平均值,否则它将nan有价值。例如,在2016-04-06John 上得到 anan因为他没有2016-04-05和 的数据2016-04-04。所以结果将是这样的:

         Date  User  Value  Value_Average_Past_2_days
0  2016-04-01  Mike    1.0                        NaN
1  2016-04-01  John    2.0                        NaN
2  2016-04-02  Mike    1.0                       1.00
3  2016-04-02  John    3.0                       2.00
4  2016-04-03  Mike    4.5                       1.00
5  2016-04-04  Mike    1.0                       2.75
6  2016-04-05  Mike    2.0                       2.75
7  2016-04-06  Mike    3.0                       1.50
8  2016-04-06  John    6.0                        NaN

It seems that I should a combination of group_byand customized rolling_meanafter reading several posts in the forum, but I couldn't quite figure out how to do it.

看了论坛的几篇帖子,看来应该是组合group_by和定制rolling_mean了,但是我也不太明白怎么做。

采纳答案by jezrael

I think you can use first convert column Dateto_datetime, then find missing Daysby groupbywith resampleand last applyrolling

我认为您可以先使用 convert column Dateto_datetime,然后Days通过groupbywithresample和 last找到丢失的列applyrolling

test['Date'] = pd.to_datetime(test['Date'])

df = test.groupby('User').apply(lambda x: x.set_index('Date').resample('1D').first())
print df
                 User  Value
User Date                   
John 2016-04-01  John    2.0
     2016-04-02  John    3.0
     2016-04-03   NaN    NaN
     2016-04-04   NaN    NaN
     2016-04-05   NaN    NaN
     2016-04-06  John    6.0
Mike 2016-04-01  Mike    1.0
     2016-04-02  Mike    1.0
     2016-04-03  Mike    4.5
     2016-04-04  Mike    1.0
     2016-04-05  Mike    2.0

df1 = df.groupby(level=0)['Value']
        .apply(lambda x: x.shift().rolling(min_periods=1,window=2).mean())
        .reset_index(name='Value_Average_Past_2_days')
print df1
    User       Date  Value_Average_Past_2_days
0   John 2016-04-01                        NaN
1   John 2016-04-02                       2.00
2   John 2016-04-03                       2.50
3   John 2016-04-04                       3.00
4   John 2016-04-05                        NaN
5   John 2016-04-06                        NaN
6   Mike 2016-04-01                        NaN
7   Mike 2016-04-02                       1.00
8   Mike 2016-04-03                       1.00
9   Mike 2016-04-04                       2.75
10  Mike 2016-04-05                       2.75
11  Mike 2016-04-06                       1.50

print pd.merge(test, df1, on=['Date', 'User'], how='left')
        Date  User  Value  Value_Average_Past_2_days
0 2016-04-01  Mike    1.0                        NaN
1 2016-04-01  John    2.0                        NaN
2 2016-04-02  Mike    1.0                       1.00
3 2016-04-02  John    3.0                       2.00
4 2016-04-03  Mike    4.5                       1.00
5 2016-04-04  Mike    1.0                       2.75
6 2016-04-05  Mike    2.0                       2.75
7 2016-04-06  Mike    3.0                       1.50
8 2016-04-06  John    6.0                        NaN

回答by Alexander

n = 2

# Cast your dates as timestamps.
test['Date'] = pd.to_datetime(test.Date)

# Create a daily index spanning the range of the original index.
idx = pd.date_range(test.Date.min(), test.Date.max(), freq='D')

# Pivot by Dates and Users.
df = test.pivot(index='Date', values='Value', columns='User').reindex(idx)
>>> df.head(3)
User        John  Mike
2016-04-01     2   1.0
2016-04-02     3   1.0
2016-04-03   NaN   4.5

# Apply a rolling mean on the above dataframe and reset the index.
df2 = (pd.rolling_mean(df.shift(), n, min_periods=1)
       .reset_index()
       .drop_duplicates())

# For Pandas 0.18.0+
df2 = (df.shift().rolling(window=n, min_periods=1).mean()
       .reset_index()
       .drop_duplicates())

# Melt the result back into the original form.
df3 = (pd.melt(df2, id_vars='Date', value_name='Value')
       .sort_values(['Date', 'User'])
       .reset_index(drop=True))
>>> df3.head()
        Date  User  Value
0 2016-04-01  John    NaN
1 2016-04-01  Mike    NaN
2 2016-04-02  John    2.0
3 2016-04-02  Mike    1.0
4 2016-04-03  John    2.5

# Merge the results back into the original dataframe.
>>> test.merge(df3, on=['Date', 'User'], how='left', 
               suffixes=['', '_Average_past_{0}_days'.format(n)])

        Date  User  Value  Value_Average_past_2_days
0 2016-04-01  Mike    1.0                        NaN
1 2016-04-01  John    2.0                        NaN
2 2016-04-02  Mike    1.0                       1.00
3 2016-04-02  John    3.0                       2.00
4 2016-04-03  Mike    4.5                       1.00
5 2016-04-04  Mike    1.0                       2.75
6 2016-04-05  Mike    2.0                       2.75
7 2016-04-06  Mike    3.0                       1.50
8 2016-04-06  John    6.0                        NaN

Summary

概括

n = 2
test['Date'] = pd.to_datetime(test.Date)
idx = pd.date_range(test.Date.min(), test.Date.max(), freq='D')
df = test.pivot(index='Date', values='Value', columns='User').reindex(idx)
df2 = (pd.rolling_mean(df.shift(), n, min_periods=1)
       .reset_index()
       .drop_duplicates())
df3 = (pd.melt(df2, id_vars='Date', value_name='Value')
       .sort_values(['Date', 'User'])
       .reset_index(drop=True))
test.merge(df3, on=['Date', 'User'], how='left', 
           suffixes=['', '_Average_past_{0}_days'.format(n)])