Python 熊猫按月和年分组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26646191/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas groupby month and year
提问by darkpool
I have the following dataframe:
我有以下数据框:
Date abc xyz
01-Jun-13 100 200
03-Jun-13 -20 50
15-Aug-13 40 -5
20-Jan-14 25 15
21-Feb-14 60 80
I need to group the data by year and month. ie: Group by Jan 2013, Feb 2013, Mar 2013 etc... I will be using the newly grouped data to create a plot showing abc vs xyz per year/month.
我需要按年和月对数据进行分组。即:按 2013 年 1 月、2013 年 2 月、2013 年 3 月等分组...我将使用新分组的数据创建一个图表,显示每年/每月的 abc 与 xyz。
I've tried various combinations of groupby and sum but just can't seem to get anything to work.
我已经尝试了 groupby 和 sum 的各种组合,但似乎无法得到任何工作。
Thank you for any assistance.
感谢您提供任何帮助。
采纳答案by Andy Hayden
You can use either resample or Grouper(which resamples under the hood).
您可以使用重新采样或Grouper(在引擎盖下重新采样)。
First make sure that the datetime column is actually of datetimes (hit it with pd.to_datetime). It's easier if it's a DatetimeIndex:
首先确保日期时间列实际上是日期时间(用 打它pd.to_datetime)。如果它是 DatetimeIndex 则更容易:
In [11]: df1
Out[11]:
abc xyz
Date
2013-06-01 100 200
2013-06-03 -20 50
2013-08-15 40 -5
2014-01-20 25 15
2014-02-21 60 80
In [12]: g = df1.groupby(pd.Grouper(freq="M")) ?# DataFrameGroupBy (grouped by Month)
In [13]: g.sum()
Out[13]:
abc xyz
Date
2013-06-30 80 250
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
In [14]: df1.resample("M", how='sum') ?# the same
Out[14]:
abc xyz
Date
2013-06-30 40 125
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
Note: Previously pd.Grouper(freq="M")was written as pd.TimeGrouper("M"). The latter is now deprecated since 0.21.
注意:以前pd.Grouper(freq="M")写为pd.TimeGrouper("M"). 后者自 0.21 起已被弃用。
I had thought the following would work, but it doesn't (due to as_indexnot being respected? I'm not sure.). I'm including this for interest's sake.
我曾认为以下内容会起作用,但它不会(由于as_index不受尊重?我不确定。)。为了利益,我将其包括在内。
If it's a column (it has to be a datetime64 column! as I say, hit it with to_datetime), you can use the PeriodIndex:
如果它是一列(它必须是一个 datetime64 列!正如我所说,用 打它to_datetime),你可以使用 PeriodIndex:
In [21]: df
Out[21]:
Date abc xyz
0 2013-06-01 100 200
1 2013-06-03 -20 50
2 2013-08-15 40 -5
3 2014-01-20 25 15
4 2014-02-21 60 80
In [22]: pd.DatetimeIndex(df.Date).to_period("M") # old way
Out[22]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-06, ..., 2014-02]
Length: 5, Freq: M
In [23]: per = df.Date.dt.to_period("M") # new way to get the same
In [24]: g = df.groupby(per)
In [25]: g.sum() # dang not quite what we want (doesn't fill in the gaps)
Out[25]:
abc xyz
2013-06 80 250
2013-08 40 -5
2014-01 25 15
2014-02 60 80
To get the desired result we have to reindex...
为了得到想要的结果,我们必须重新索引...
回答by darkpool
There are different ways to do that.
有不同的方法可以做到这一点。
- I created the data frame to showcase the different techniques to filter your data.
- 我创建了数据框来展示过滤数据的不同技术。
df = pd.DataFrame({'Date':['01-Jun-13','03-Jun-13', '15-Aug-13', '20-Jan-14', '21-Feb-14'],'abc':[100,-20,40,25,60],'xyz':[200,50,-5,15,80] })
df = pd.DataFrame({'Date':['01-Jun-13','03-Jun-13', '15-Aug-13', '20-Jan-14', '21-Feb-14'],'abc':[100,-20,40,25,60],'xyz':[200,50,-5,15,80] })
- I separated months/year/day and seperated month-year as you explained.
- 正如您所解释的,我将月/年/日分开,并将月年分开。
def getMonth(s): return s.split("-")[1] def getDay(s): return s.split("-")[0] def getYear(s): return s.split("-")[2] def getYearMonth(s): return s.split("-")[1]+"-"+s.split("-")[2]
def getMonth(s): return s.split("-")[1] def getDay(s): return s.split("-")[0] def getYear(s): return s.split("-")[2] def getYearMonth(s): return s.split("-")[1]+"-"+s.split("-")[2]
- I created new columns:
year,month,dayand 'yearMonth'. In your case, you need one of both. You can group using two columns'year','month'or using one columnyearMonth
- 我创建的新列:
year,month,day和“yearMonth”。在您的情况下,您需要两者之一。您可以使用两列'year','month'或使用一列进行分组yearMonth
df['year']= df['Date'].apply(lambda x: getYear(x)) df['month']= df['Date'].apply(lambda x: getMonth(x)) df['day']= df['Date'].apply(lambda x: getDay(x)) df['YearMonth']= df['Date'].apply(lambda x: getYearMonth(x))
df['year']= df['Date'].apply(lambda x: getYear(x)) df['month']= df['Date'].apply(lambda x: getMonth(x)) df['day']= df['Date'].apply(lambda x: getDay(x)) df['YearMonth']= df['Date'].apply(lambda x: getYearMonth(x))
Output:
输出:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
3 20-Jan-14 25 15 14 Jan 20 Jan-14
4 21-Feb-14 60 80 14 Feb 21 Feb-14
- You can go through the different groups in groupby(..) items.
- 您可以浏览 groupby(..) 项目中的不同组。
In this case, we are grouping by two columns:
在这种情况下,我们按两列分组:
for key,g in df.groupby(['year','month']): print key,g
for key,g in df.groupby(['year','month']): print key,g
Output:
输出:
('13', 'Jun') Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
('13', 'Aug') Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
('14', 'Jan') Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
('14', 'Feb') Date abc xyz year month day YearMonth
In this case, we are grouping by one column:
在这种情况下,我们按一列分组:
for key,g in df.groupby(['YearMonth']): print key,g
for key,g in df.groupby(['YearMonth']): print key,g
Output:
输出:
Jun-13 Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
Aug-13 Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
Jan-14 Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
Feb-14 Date abc xyz year month day YearMonth
4 21-Feb-14 60 80 14 Feb 21 Feb-14
- In case you wanna access to specific item, you can use
get_group
- 如果您想访问特定项目,您可以使用
get_group
print df.groupby(['YearMonth']).get_group('Jun-13')
打印 df.groupby(['YearMonth']).get_group('Jun-13')
Output:
输出:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
- Similar to
get_group. This hack would help to filter values and get the grouped values.
- 类似于
get_group。此 hack 将有助于过滤值并获取分组值。
This also would give the same result.
这也会产生相同的结果。
print df[df['YearMonth']=='Jun-13']
Output:
输出:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
You can select list of abcor xyzvalues during Jun-13
您可以在期间选择列表abc或xyz值Jun-13
print df[df['YearMonth']=='Jun-13'].abc.values
print df[df['YearMonth']=='Jun-13'].xyz.values
Output:
输出:
[100 -20] #abc values
[200 50] #xyz values
You can use this to go through the dates that you have classified as "year-month" and apply cretiria on it to get related data.
您可以使用它来查看您归类为“年-月”的日期,并对其应用 cretiria 以获取相关数据。
for x in set(df.YearMonth):
print df[df['YearMonth']==x].abc.values
print df[df['YearMonth']==x].xyz.values
I recommend also to check this answeras well.
我也建议检查这个答案。
回答by Q-man
Why not keep it simple?!
为什么不保持简单?!
GB=DF.groupby([(DF.index.year),(DF.index.month)]).sum()
giving you,
给你,
print(GB)
abc xyz
2013 6 80 250
8 40 -5
2014 1 25 15
2 60 80
and then you can plot like asked using,
然后你可以按照要求进行绘图,
GB.plot('abc','xyz',kind='scatter')
回答by tsando
You can also do it by creating a string column with the year and month as follows:
您还可以通过创建一个带有年份和月份的字符串列来实现,如下所示:
df['date'] = df.index
df['year-month'] = df['date'].apply(lambda x: str(x.year) + ' ' + str(x.month))
grouped = df.groupby('year-month')
Howeverthis doesn't preserve the order when you loop over the groups, e.g.
但是,当您遍历组时,这不会保留顺序,例如
for name, group in grouped:
print(name)
Will give:
会给:
2007 11
2007 12
2008 1
2008 10
2008 11
2008 12
2008 2
2008 3
2008 4
2008 5
2008 6
2008 7
2008 8
2008 9
2009 1
2009 10
So then, if you want to preserve the order, you must do as suggested by @Q-man above:
那么,如果您想保留顺序,则必须按照上面@Q-man 的建议进行操作:
grouped = df.groupby([df.index.year, df.index.month])
This will preserve the order in the above loop:
这将保留上述循环中的顺序:
(2007, 11)
(2007, 12)
(2008, 1)
(2008, 2)
(2008, 3)
(2008, 4)
(2008, 5)
(2008, 6)
(2008, 7)
(2008, 8)
(2008, 9)
(2008, 10)

