Python Pandas 列中的总和值如果日期介于 2 个日期之间

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48103845/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:00:57  来源:igfitidea点击:

Python Pandas Sum Values in Columns If date between 2 dates

pythonpandasdataframepandas-groupbymelt

提问by clg4

I have a dataframe dfwhich can be created with this:

我有一个df可以用这个创建的数据框:

data={'id':[1,1,1,1,2,2,2,2],
      'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,3),datetime.date(2016,1,4),
               datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)],
      'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5),
               datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],
      'score1':[5,7,3,2,9,3,8,3],
      'score2':[1,3,0,5,2,20,7,7]}
df=pd.DataFrame.from_dict(data)

And looks like this:
   id       date1       date2  score1  score2
0   1  2016-01-01  2016-01-05       5       1
1   1  2016-01-02  2016-01-03       7       3
2   1  2016-01-03  2016-01-05       3       0
3   1  2016-01-04  2016-01-05       2       5
4   2  2016-01-02  2016-01-04       9       2
5   2  2016-01-04  2016-01-05       3      20
6   2  2016-01-03  2016-01-04       8       7
7   2  2016-01-01  2016-01-01       3       7

What I need to do is create a column for each of score1and score2, which creates two columns which SUM the values of score1and score2respectively, based on whether the usedateis between date1and date2. usedateis created by getting all dates between and including the date1minimum and the date2maximum. I used this to create the date range:

我需要做的是为每个score1and创建一列score2,它创建两列,分别根据和之间的值对score1和的值求和。是通过获取最小值和最大值之间(包括最小值和最大值)的所有日期来创建的。我用它来创建日期范围:score2usedatedate1date2usedatedate1date2

drange=pd.date_range(df.date1.min(),df.date2.max())    

The resulting dataframe newdfshould look like:

生成的数据框newdf应如下所示:

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26

For clarification, on usedate2016-01-01, score1sumis 8, which is calculated by looking at the rows in dfwhere 2016-01-01 is between and including date1and date2, which sum row0(5) and row8(3). On usedate2016-01-04, score2sumis 35, which is calculated by looking at the rows in dfwhere 2016-01-04 is between and including date1and date2, which sum row0(1), row3(0), row4(5), row5(2), row6(20), row7(7).

为澄清起见,在usedate2016-01-01 上,score1sum是 8,这是通过查看df2016-01-01 位于并包括date1and的行来计算的,这些行对date2row0(5) 和 row8(3) 求和。在 2016 年 1 月 4usedate日,score2sum是 35,这是通过查看2016 年 1 月 4日在 2016 年 1 月 4日df之间并包括date1和中的行来计算的date2,它们和 row0(1), row3(0), row4(5), row5( 2)、第 6 行(20)、第 7 行(7)。

Maybe some kind of groupby, or meltthen groupby?

也许某种groupby,或者melt然后groupby

采纳答案by Scott Boston

You can use applywith lambda function:

您可以apply与 lambda 函数一起使用:

df['date1'] = pd.to_datetime(df['date1'])

df['date2'] = pd.to_datetime(df['date2'])

df1 = pd.DataFrame(index=pd.date_range(df.date1.min(), df.date2.max()), columns = ['score1sum', 'score2sum'])

df1[['score1sum','score2sum']] = df1.apply(lambda x: df.loc[(df.date1 <= x.name) & 
                                                            (x.name <= df.date2),
                                                            ['score1','score2']].sum(), axis=1)

df1.rename_axis('usedate').reset_index()

Output:

输出:

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26

回答by Peter Leimbigler

Method 1: list comprehensions

方法一:列表推导式

This is inelegant, but hey, it works! (EDIT: added a second method below.)

这是不雅的,但是嘿,它有效!(编辑:在下面添加了第二种方法。)

# Convert datetime.date to pandas timestamps for easier comparisons
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])

# solution
newdf = pd.DataFrame(data=drange, columns=['usedate'])
# for each usedate ud, get all df rows whose dates contain ud,
# then sum the scores of these rows
newdf['score1sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score1'].sum() for ud in drange]
newdf['score2sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score2'].sum() for ud in drange]

# output
newdf
     usedate  score1sum  score2sum
  2016-01-01          8          8
  2016-01-02         21          6
  2016-01-03         32         13
  2016-01-04         30         35
  2016-01-05         13         26

Method 2: a helper function with transform(or apply)

方法 2:带有transform(或apply)的辅助函数

newdf = pd.DataFrame(data=drange, columns=['usedate'])

def sum_scores(d):
    return df[(df['date1'] <= d) & (df['date2'] >= d)][['score1', 'score2']].sum()

# apply works here too, and is about equally fast in my testing
newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores)

# newdf is same to above

Timings are comparable

时间是可比的

# Jupyter timeit cell magic
%%timeit 
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score1'].sum() for d in drange]
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score2'].sum() for d in drange]

100 loops, best of 3: 10.4 ms per loop

# Jupyter timeit line magic
%timeit newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores) 

100 loops, best of 3: 8.51 ms per loop