Python Pandas 列中的总和值如果日期介于 2 个日期之间

Question

提问by clg4

I have a dataframe dfwhich can be created with this:

我有一个df可以用这个创建的数据框：

data={'id':[1,1,1,1,2,2,2,2],
      'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,3),datetime.date(2016,1,4),
               datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)],
      'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5),
               datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],
      'score1':[5,7,3,2,9,3,8,3],
      'score2':[1,3,0,5,2,20,7,7]}
df=pd.DataFrame.from_dict(data)

And looks like this:
   id       date1       date2  score1  score2
0   1  2016-01-01  2016-01-05       5       1
1   1  2016-01-02  2016-01-03       7       3
2   1  2016-01-03  2016-01-05       3       0
3   1  2016-01-04  2016-01-05       2       5
4   2  2016-01-02  2016-01-04       9       2
5   2  2016-01-04  2016-01-05       3      20
6   2  2016-01-03  2016-01-04       8       7
7   2  2016-01-01  2016-01-01       3       7

What I need to do is create a column for each of score1and score2, which creates two columns which SUM the values of score1and score2respectively, based on whether the usedateis between date1and date2. usedateis created by getting all dates between and including the date1minimum and the date2maximum. I used this to create the date range:

我需要做的是为每个score1and创建一列score2，它创建两列，分别根据和之间的值对score1和的值求和。是通过获取最小值和最大值之间（包括最小值和最大值）的所有日期来创建的。我用它来创建日期范围：score2usedatedate1date2usedatedate1date2

drange=pd.date_range(df.date1.min(),df.date2.max())

The resulting dataframe newdfshould look like:

生成的数据框newdf应如下所示：

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26

For clarification, on usedate2016-01-01, score1sumis 8, which is calculated by looking at the rows in dfwhere 2016-01-01 is between and including date1and date2, which sum row0(5) and row8(3). On usedate2016-01-04, score2sumis 35, which is calculated by looking at the rows in dfwhere 2016-01-04 is between and including date1and date2, which sum row0(1), row3(0), row4(5), row5(2), row6(20), row7(7).

为澄清起见，在usedate2016-01-01 上，score1sum是 8，这是通过查看df2016-01-01 位于并包括date1and的行来计算的，这些行对date2row0(5) 和 row8(3) 求和。在 2016 年 1 月 4usedate日，score2sum是 35，这是通过查看2016 年 1 月 4日在 2016 年 1 月 4日df之间并包括date1和中的行来计算的date2，它们和 row0(1), row3(0), row4(5), row5( 2）、第 6 行（20）、第 7 行（7）。

Maybe some kind of groupby, or meltthen groupby?

也许某种groupby，或者melt然后groupby？

Answer 1

采纳答案by Scott Boston

You can use applywith lambda function:

您可以apply与 lambda 函数一起使用：

df['date1'] = pd.to_datetime(df['date1'])

df['date2'] = pd.to_datetime(df['date2'])

df1 = pd.DataFrame(index=pd.date_range(df.date1.min(), df.date2.max()), columns = ['score1sum', 'score2sum'])

df1[['score1sum','score2sum']] = df1.apply(lambda x: df.loc[(df.date1 <= x.name) & 
                                                            (x.name <= df.date2),
                                                            ['score1','score2']].sum(), axis=1)

df1.rename_axis('usedate').reset_index()

Output:

输出：

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26

Answer 2

回答by Peter Leimbigler

Method 1: list comprehensions

方法一：列表推导式

This is inelegant, but hey, it works! (EDIT: added a second method below.)

这是不雅的，但是嘿，它有效！（编辑：在下面添加了第二种方法。）

# Convert datetime.date to pandas timestamps for easier comparisons
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])

# solution
newdf = pd.DataFrame(data=drange, columns=['usedate'])
# for each usedate ud, get all df rows whose dates contain ud,
# then sum the scores of these rows
newdf['score1sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score1'].sum() for ud in drange]
newdf['score2sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score2'].sum() for ud in drange]

# output
newdf
     usedate  score1sum  score2sum
  2016-01-01          8          8
  2016-01-02         21          6
  2016-01-03         32         13
  2016-01-04         30         35
  2016-01-05         13         26

Method 2: a helper function with `transform`(or `apply`)

方法 2：带有`transform`(或`apply`)的辅助函数

newdf = pd.DataFrame(data=drange, columns=['usedate'])

def sum_scores(d):
    return df[(df['date1'] <= d) & (df['date2'] >= d)][['score1', 'score2']].sum()

# apply works here too, and is about equally fast in my testing
newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores)

# newdf is same to above

Timings are comparable

时间是可比的

# Jupyter timeit cell magic
%%timeit 
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score1'].sum() for d in drange]
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score2'].sum() for d in drange]

100 loops, best of 3: 10.4 ms per loop

# Jupyter timeit line magic
%timeit newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores) 

100 loops, best of 3: 8.51 ms per loop

Python Pandas 列中的总和值如果日期介于 2 个日期之间

提问by clg4

采纳答案by Scott Boston

回答by Peter Leimbigler

Method 1: list comprehensions

方法一：列表推导式

Method 2: a helper function with `transform`(or `apply`)

方法 2：带有`transform`(或`apply`)的辅助函数

Timings are comparable

时间是可比的

相关推荐

最近更新

标签

Python Pandas 列中的总和值如果日期介于 2 个日期之间

提问by clg4

采纳答案by Scott Boston

回答by Peter Leimbigler

Method 1: list comprehensions

方法一：列表推导式

Method 2: a helper function with transform(or apply)

方法 2：带有transform(或apply)的辅助函数

Timings are comparable

时间是可比的

相关推荐

pandas 熊猫合并错误类型错误：“int”和“str”的实例之间不支持“>”

在 Pandas 中，.iloc 方法是否提供副本或视图？

导入 pandas.io.data

如何使用 Pandas 从 Word 文档 (.docx) 文件中的表格创建数据框

相关推荐

最近更新

标签

Method 2: a helper function with `transform`(or `apply`)

方法 2：带有`transform`(或`apply`)的辅助函数