pandas python按周或月拆分pandas数据框,并根据这些sp对数据进行分组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13223360/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:28:54  来源:igfitidea点击:

python split a pandas data frame by week or month and group the data based on these sp

pythondatetimegroup-bydataframepandas

提问by George Thompson

DateOccurred    CostCentre  TimeDifference
03/09/2012  2073    28138
03/09/2012  6078    34844
03/09/2012  8273    31215
03/09/2012  8367    28160
03/09/2012  8959    32037
03/09/2012  9292    30118
03/09/2012  9532    34200
03/09/2012  9705    27240
03/09/2012  10085   31431
03/09/2012  10220   22555
04/09/2012  6078    41126
04/09/2012  7569    31101
04/09/2012  8273    30994
04/09/2012  8959    30064
04/09/2012  9532    34655
04/09/2012  9705    26475
04/09/2012  10085   31443
04/09/2012  10220   33970
05/09/2012  2073    28221
05/09/2012  6078    27894
05/09/2012  7569    29012
05/09/2012  8239    42208
05/09/2012  8273    31128
05/09/2012  8367    27993
05/09/2012  8959    20669
05/09/2012  9292    33070
05/09/2012  9532    8189
05/09/2012  9705    27540
05/09/2012  10085   28798
05/09/2012  10220   23164
06/09/2012  2073    28350
06/09/2012  6078    35648
06/09/2012  7042    27129
06/09/2012  7569    31546
06/09/2012  8239    39945
06/09/2012  8273    31107
06/09/2012  8367    27795
06/09/2012  9292    32974
06/09/2012  9532    30320
06/09/2012  9705    37462
06/09/2012  10085   31703
06/09/2012  10220   7807
06/09/2012  14573   186
07/09/2012  0   0
07/09/2012  0   0
07/09/2012  2073    28036
07/09/2012  6078    31969
07/09/2012  7569    32941
07/09/2012  8273    30073
07/09/2012  8367    29391
07/09/2012  9292    31927
07/09/2012  9532    30127
07/09/2012  9705    27604
07/09/2012  10085   28108
08/09/2012  2073    28463
10/09/2012  6078    31266
10/09/2012  8239    16390
10/09/2012  8273    31140
10/09/2012  8959    30858
10/09/2012  9532    30794
10/09/2012  9705    28752
11/09/2012  0   0
11/09/2012  0   0
11/09/2012  0   0
11/09/2012  0   0
11/09/2012  0   0
11/09/2012  2073    28159
11/09/2012  6078    36835
11/09/2012  8239    45354
11/09/2012  8273    30922
11/09/2012  8367    31382
11/09/2012  8959    29670
11/09/2012  9292    33582
11/09/2012  9705    29394
11/09/2012  10085   17140
12/09/2012  2073    28283
12/09/2012  6078    31139
12/09/2012  7042    35063
12/09/2012  8273    31075
12/09/2012  8367    29795
12/09/2012  9292    33496
12/09/2012  9532    31669
12/09/2012  9705    26166
12/09/2012  10085   29889
12/09/2012  10220   35656
13/09/2012  2073    28144
13/09/2012  6078    30544
13/09/2012  7097    30866
13/09/2012  8273    30772
13/09/2012  8367    32387
13/09/2012  8959    29307
13/09/2012  9292    32348
13/09/2012  9532    28137
13/09/2012  9705    28823
13/09/2012  10085   31543
13/09/2012  10220   28293
14/09/2012  0   12433
14/09/2012  0   12434
14/09/2012  0   12434
14/09/2012  0   12434
14/09/2012  0   12434
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   12433
14/09/2012  0   0
14/09/2012  0   12433
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   1720
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   384
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   383
14/09/2012  2073    28438
14/09/2012  6078    27255
14/09/2012  8273    29989
14/09/2012  8959    26892
14/09/2012  9292    33202
14/09/2012  9532    30862
14/09/2012  9705    26857
14/09/2012  10085   32657
14/09/2012  10220   27296
15/09/2012  6078    3832
17/09/2012  6078    30004
17/09/2012  7569    30390
17/09/2012  8239    41421
17/09/2012  8273    26337
17/09/2012  8367    31631
17/09/2012  8959    17989
17/09/2012  9292    35703
17/09/2012  9532    36542
17/09/2012  9705    27488
17/09/2012  10085   30849
17/09/2012  10220   32575
18/09/2012  2073    28293
18/09/2012  6078    27450
18/09/2012  7569    30323
18/09/2012  8239    38481
18/09/2012  8273    31154
18/09/2012  8367    27944
18/09/2012  8959    28196
18/09/2012  9292    30844
18/09/2012  9532    33128
18/09/2012  9705    32100
19/09/2012  2073    28227
19/09/2012  6078    32243
19/09/2012  7569    29041
19/09/2012  8239    42791
19/09/2012  8273    30966
19/09/2012  8367    26420
19/09/2012  8959    29394
19/09/2012  9292    14865
19/09/2012  9532    23618
19/09/2012  10085   31614
19/09/2012  10220   8686
20/09/2012  2073    28260
20/09/2012  6078    30446
20/09/2012  7097    34909
20/09/2012  7569    30869
20/09/2012  8273    31079
20/09/2012  8367    30162
20/09/2012  9292    13104
20/09/2012  9532    36614
20/09/2012  9705    35617
20/09/2012  10085   31821
20/09/2012  10220   30055
20/09/2012  14573   468
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   3
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   3
21/09/2012  2073    28308
21/09/2012  6078    33833
21/09/2012  7569    32335
21/09/2012  9292    33824
21/09/2012  9532    33376
21/09/2012  10220   21002
22/09/2012  2073    28402
23/09/2012  2073    28109
24/09/2012  2073    28431
24/09/2012  6078    30027
24/09/2012  7097    31914
24/09/2012  8239    35617
24/09/2012  8273    30670
24/09/2012  8367    29084
24/09/2012  8959    31023
24/09/2012  9292    34394
24/09/2012  9532    31255
24/09/2012  9705    18758
24/09/2012  10085   29290
24/09/2012  10220   33230
25/09/2012  2073    28506
25/09/2012  6078    32043
25/09/2012  7042    34953
25/09/2012  7569    30898
25/09/2012  8239    41297
25/09/2012  8273    31012
25/09/2012  8367    29645
25/09/2012  8959    29904
25/09/2012  9532    37875
25/09/2012  9705    13280
25/09/2012  10085   35023
25/09/2012  10220   31359
26/09/2012  2073    28388
26/09/2012  6078    29765
26/09/2012  7097    31561
26/09/2012  7569    29151
26/09/2012  8239    40369
26/09/2012  8367    28174
26/09/2012  8959    26554
26/09/2012  9292    32104
26/09/2012  9532    33194
26/09/2012  9705    30377
26/09/2012  10085   31503
26/09/2012  10220   28310
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  2073    28491
27/09/2012  6078    31137
27/09/2012  8239    38403
27/09/2012  8273    31117
27/09/2012  8367    28462
27/09/2012  9292    32387
27/09/2012  9532    23023
27/09/2012  9705    32790
27/09/2012  10085   33460
27/09/2012  10220   31782
28/09/2012  0   161
28/09/2012  2073    28381
28/09/2012  7569    32322
28/09/2012  8239    38362
28/09/2012  8273    30533
28/09/2012  8959    17128
28/09/2012  9292    32484
28/09/2012  9532    18586
28/09/2012  9705    27902
29/09/2012  2073    28583
  1. Above is a sample of a dataframe which has a million records
  2. How can I slice or group it by Week or Month and sum seconds column by cost centre.?*
  3. I have read/tried 30 of the articles on this site which appear by doing a search for
    List item pandas, python, groupby, split, dataframe, week with out success.
  4. I am using python 2.7 and pandas 0.9.
  5. I've read the Time Series / Date functionality section in the pandas 0.9 tutorial but couldn't make anything work with a dataframe. I would like to use the features in there such as Business week
  1. 以上是一个包含一百万条记录的数据框示例
  2. 如何按周或月对它进行切片或分组,并按成本中心对秒列求和。?*
  3. 我已经阅读/尝试了本网站上的 30 篇文章,这些文章是通过搜索
    List item pandas、python、groupby、split、dataframe、week 出现的,但都没有成功。
  4. 我正在使用 python 2.7 和 Pandas 0.9。
  5. 我已经阅读了 Pandas 0.9 教程中的时间序列/日期功能部分,但无法使用数据框进行任何操作。我想使用那里的功能,例如商业周

Expected Output

预期产出

DateOccurred CostCentre TimeDifference
2012-03-11            0         500000
2012-03-11         2073         570000
2012-03-18            0         650000
2012-03-18         2073         425000 
2012-03-25            0         378000
2012-04-25         2073         480000

回答by Jon-Eric

Here's a way to take your input (as text) and group it the way you want. The key is to use a dictionary for each grouping (date, then centre).

这是一种将您的输入(作为文本)输入并按照您想要的方式对其进行分组的方法。关键是为每个分组(日期,然后是中心)使用字典。

import collections
import datetime
import functools

def delta_totals_by_date_and_centre(in_file):
    # Use a defaultdict instead of a normal dict so that missing values are
    # automatically created. by_date is a mapping (dict) from a tuple of (year, week)
    # to another mapping (dict) from centre to total delta time.
    by_date = collections.defaultdict(functools.partial(collections.defaultdict, int))

    # For each line in the input...
    for line in in_file:
        # Parse the three fields of each line into date, int ,int.
        date, centre, delta = line.split()
        date = datetime.datetime.strptime(date, "%d/%m/%Y").date()
        centre = int(centre)
        delta = int(delta)

        # Determine the year and week of the year.
        year, week, weekday = date.isocalendar()
        year_and_week = year, week

        # Add the time delta.
        by_date[year_and_week][centre] += delta

    # Yield each result, in order.
    for year_and_week, by_centre in sorted(by_date.items()):
        for centre, delta in sorted(by_centre.items()):
            yield year_and_week, centre, delta

For your sample input, it produces this output (where the first column is year-week_of_the_year).

对于您的示例输入,它会生成此输出(其中第一列是year-week_of_the_year)。

2012-36     0      0
2012-36  2073 141208
2012-36  6078 171481
2012-36  7042  27129
2012-36  7569 124600
2012-36  8239  82153
2012-36  8273 154517
2012-36  8367 113339
2012-36  8959  82770
2012-36  9292 128089
2012-36  9532 137491
2012-36  9705 146321
2012-36 10085 151483
2012-36 10220  87496
2012-36 14573    186
2012-37     0  89522
2012-37  2073 113024
2012-37  6078 160871
2012-37  7042  35063
2012-37  7097  30866
2012-37  8239  61744
2012-37  8273 153898
2012-37  8367  93564
2012-37  8959 116727
2012-37  9292 132628
2012-37  9532 121462
2012-37  9705 139992
2012-37 10085 111229
2012-37 10220  91245
2012-38     0      6
2012-38  2073 169599
2012-38  6078 153976
2012-38  7097  34909
2012-38  7569 152958
2012-38  8239 122693
2012-38  8273 119536
2012-38  8367 116157
2012-38  8959  75579
2012-38  9292 128340
2012-38  9532 163278
2012-38  9705  95205
2012-38 10085  94284
2012-38 10220  92318
2012-38 14573    468
2012-39     0    161
2012-39  2073 170780
2012-39  6078 122972
2012-39  7042  34953
2012-39  7097  63475
2012-39  7569  92371
2012-39  8239 194048
2012-39  8273 123332
2012-39  8367 115365
2012-39  8959 104609
2012-39  9292 131369
2012-39  9532 143933
2012-39  9705 123107
2012-39 10085 129276
2012-39 10220 124681

回答by Garrett

Perhaps group by CostCentre first, then use Series/DataFrame resample()?

也许先按 CostCentre 分组,然后再使用 Series/DataFrame resample()

In [72]: centers = {}

In [73]: for center, idx in df.groupby("CostCentre").groups.iteritems():
   ....:     timediff = df.ix[idx].set_index("Date")['TimeDifference']
   ....:     centers[center] = timediff.resample("W", how=sum)

In [77]: pd.concat(centers, names=['CostCentre'])
Out[77]: 
CostCentre  Date      
0           2012-09-09         0
            2012-09-16     89522
            2012-09-23         6
            2012-09-30       161
2073        2012-09-09    141208
            2012-09-16    113024
            2012-09-23    169599
            2012-09-30    170780
6078        2012-09-09    171481
            2012-09-16    160871
            2012-09-23    153976
            2012-09-30    122972

Additional details:

额外细节:

When parse_datesis Truefor the pd.read_* functions, index_colmust also be set.

parse_datesTrue为pd.read_ *功能,index_col还必须设置。

In [28]: df = pd.read_clipboard(sep=' +', parse_dates=True, index_col=0,
   ....:                        dayfirst=True)

In [30]: df.head()
Out[30]: 
              CostCentre  TimeDifference
DateOccurred                            
2012-09-03          2073           28138
2012-09-03          6078           34844
2012-09-03          8273           31215
2012-09-03          8367           28160
2012-09-03          8959           32037

Since resample() requires a TimeSeries-indexed frame/series, setting the index during creation eliminates the need to set the index for each group individually. GroupBy objects also have an apply method, which is basically syntactic sugar around the "combine" step done with pd.concat() above.

由于 resample() 需要 TimeSeries 索引的帧/系列,因此在创建期间设置索引无需为每个组单独设置索引。GroupBy 对象也有一个 apply 方法,它基本上是上面用 pd.concat() 完成的“组合”步骤周围的语法糖。

In [37]: x = df.groupby("CostCentre").apply(lambda df: 
   ....:         df['TimeDifference'].resample("W", how=sum))

In [38]: x.head(12)
Out[38]: 
CostCentre  DateOccurred
0           2012-09-09           0
            2012-09-16       89522
            2012-09-23           6
            2012-09-30         161
2073        2012-09-09      141208
            2012-09-16      113024
            2012-09-23      169599
            2012-09-30      170780
6078        2012-09-09      171481
            2012-09-16      160871
            2012-09-23      153976
            2012-09-30      122972