在 python/pandas 中按月对每日数据进行分组,然后进行标准化
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18677271/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Grouping daily data by month in python/pandas and then normalizing
提问by user7289
I have the table below in a Pandas DataFrame
:
我在 Pandas 中有下表DataFrame
:
q_string q_visits q_date
0 nucleus 1790 2012-10-02 00:00:00
1 neuron 364 2012-10-02 00:00:00
2 current 280 2012-10-02 00:00:00
3 molecular 259 2012-10-02 00:00:00
4 stem 201 2012-10-02 00:00:00
The table contains query volume from a server log, by day. I would like to do 2 things:
该表包含来自服务器日志的查询量,按天计算。我想做两件事:
- I would like to group queries by month summing the query volume of a query for the whole month e.g. if 'molecular' was present on the 2012-10-02 with volume 1000 and on the 2012-10-03 with volume 500, then it should have an entry in the new table of 1500 (volume) with date 2012-10-31 (end of the month end-point representing the month – all dates in the transformed table will be month ends representing the whole month to which they relate).
- I want to add a 5th column which contains the month-normalized
q_visits
. I.e., a term's monthly query volume divided by the total query volume for the month across all terms.
- 我想按月对查询进行分组,汇总整个月的查询量,例如,如果 2012-10-02 的卷为 1000 和 2012-10-03 的卷为 500,则它新表中应该有一个条目为 1500(卷),日期为 2012-10-31(月末结束点代表月份 - 转换表中的所有日期都将是代表与其相关的整个月份的月末) )。
- 我想添加包含月份的第 5 列-normalized
q_visits
。即,一个术语的每月查询量除以所有术语的当月总查询量。
What is the best way of doing this?
这样做的最佳方法是什么?
采纳答案by Phillip Cloud
If I understand you correctly:
如果我理解正确:
For (1) do this:
对于(1)这样做:
Make some fake data by sampling from the values you gave and some random dates and # of visits:
通过从您提供的值和一些随机日期和访问次数中采样来制作一些假数据:
In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')
In [180]: visits = Series(poisson(1000, size=100), name='date')
In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')
In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})
In [183]: df.head()
Out[183]:
date string visits
0 2001-11-15 00:00:00 current 997
1 2001-11-15 00:00:00 current 974
2 2012-10-02 00:00:00 stem 982
3 2001-12-01 00:00:00 stem 984
4 2001-01-01 00:00:00 current 989
In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')
In [187]: resamp.head()
Out[187]:
visits
string date
current 2001-01-31 2996
2001-02-28 NaN
2001-03-31 NaN
2001-04-30 NaN
2001-05-31 3016
NaN
is there because there were no visits with that query string in those months.
NaN
是因为在那几个月没有访问该查询字符串。
For (2), group by the dates and then divide by the sum:
对于 (2),按日期分组,然后除以总和:
In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())
In [189]: g.head()
Out[189]:
visits
string date
current 2001-01-31 0.177
2001-02-28 NaN
2001-03-31 NaN
2001-04-30 NaN
2001-05-31 0.188
Just to convince you that (2) is doing what you want:
只是为了让您相信 (2) 正在做您想做的事:
In [176]: h = g.sortlevel('date').head()
In [177]: h
Out[177]:
visits
string date
current 2001-01-31 0.077
molecular 2001-01-31 0.228
neuron 2001-01-31 0.073
nucleus 2001-01-31 0.234
stem 2001-01-31 0.388
In [178]: h.sum()
Out[178]:
visits 1
dtype: float64
If you want to convert resamp
into a DataFrame
and remove the NaN
s do:
如果要转换resamp
为 aDataFrame
并删除NaN
s ,请执行以下操作:
In [196]: resamp.dropna()
Out[196]:
visits
string date
current 2001-01-31 2996
2001-05-31 3016
2001-11-30 5959
2001-12-31 3998
2013-09-30 1077
molecular 2001-01-31 3984
2001-05-31 1911
2001-11-30 3054
2001-12-31 1020
2012-10-31 977
2013-09-30 1947
neuron 2001-01-31 3961
2001-05-31 2069
2001-11-30 5010
2001-12-31 2065
2012-10-31 6973
2013-09-30 994
nucleus 2001-01-31 3060
2001-05-31 3035
2001-11-30 2924
2001-12-31 4144
2012-10-31 2004
2013-09-30 7881
stem 2001-01-31 2911
2001-05-31 5994
2001-11-30 6072
2001-12-31 4916
2012-10-31 1991
2013-09-30 3977
In [197]: resamp.dropna().reset_index()
Out[197]:
string date visits
0 current 2001-01-31 00:00:00 2996
1 current 2001-05-31 00:00:00 3016
2 current 2001-11-30 00:00:00 5959
3 current 2001-12-31 00:00:00 3998
4 current 2013-09-30 00:00:00 1077
5 molecular 2001-01-31 00:00:00 3984
6 molecular 2001-05-31 00:00:00 1911
7 molecular 2001-11-30 00:00:00 3054
8 molecular 2001-12-31 00:00:00 1020
9 molecular 2012-10-31 00:00:00 977
10 molecular 2013-09-30 00:00:00 1947
11 neuron 2001-01-31 00:00:00 3961
12 neuron 2001-05-31 00:00:00 2069
13 neuron 2001-11-30 00:00:00 5010
14 neuron 2001-12-31 00:00:00 2065
15 neuron 2012-10-31 00:00:00 6973
16 neuron 2013-09-30 00:00:00 994
17 nucleus 2001-01-31 00:00:00 3060
18 nucleus 2001-05-31 00:00:00 3035
19 nucleus 2001-11-30 00:00:00 2924
20 nucleus 2001-12-31 00:00:00 4144
21 nucleus 2012-10-31 00:00:00 2004
22 nucleus 2013-09-30 00:00:00 7881
23 stem 2001-01-31 00:00:00 2911
24 stem 2001-05-31 00:00:00 5994
25 stem 2001-11-30 00:00:00 6072
26 stem 2001-12-31 00:00:00 4916
27 stem 2012-10-31 00:00:00 1991
28 stem 2013-09-30 00:00:00 3977
You can of course do this for g
as well:
你当然也可以这样做g
:
In [198]: g.dropna()
Out[198]:
visits
string date
current 2001-01-31 0.177
2001-05-31 0.188
2001-11-30 0.259
2001-12-31 0.248
2013-09-30 0.068
molecular 2001-01-31 0.236
2001-05-31 0.119
2001-11-30 0.133
2001-12-31 0.063
2012-10-31 0.082
2013-09-30 0.123
neuron 2001-01-31 0.234
2001-05-31 0.129
2001-11-30 0.218
2001-12-31 0.128
2012-10-31 0.584
2013-09-30 0.063
nucleus 2001-01-31 0.181
2001-05-31 0.189
2001-11-30 0.127
2001-12-31 0.257
2012-10-31 0.168
2013-09-30 0.496
stem 2001-01-31 0.172
2001-05-31 0.374
2001-11-30 0.264
2001-12-31 0.305
2012-10-31 0.167
2013-09-30 0.251
In [199]: g.dropna().reset_index()
Out[199]:
string date visits
0 current 2001-01-31 00:00:00 0.177
1 current 2001-05-31 00:00:00 0.188
2 current 2001-11-30 00:00:00 0.259
3 current 2001-12-31 00:00:00 0.248
4 current 2013-09-30 00:00:00 0.068
5 molecular 2001-01-31 00:00:00 0.236
6 molecular 2001-05-31 00:00:00 0.119
7 molecular 2001-11-30 00:00:00 0.133
8 molecular 2001-12-31 00:00:00 0.063
9 molecular 2012-10-31 00:00:00 0.082
10 molecular 2013-09-30 00:00:00 0.123
11 neuron 2001-01-31 00:00:00 0.234
12 neuron 2001-05-31 00:00:00 0.129
13 neuron 2001-11-30 00:00:00 0.218
14 neuron 2001-12-31 00:00:00 0.128
15 neuron 2012-10-31 00:00:00 0.584
16 neuron 2013-09-30 00:00:00 0.063
17 nucleus 2001-01-31 00:00:00 0.181
18 nucleus 2001-05-31 00:00:00 0.189
19 nucleus 2001-11-30 00:00:00 0.127
20 nucleus 2001-12-31 00:00:00 0.257
21 nucleus 2012-10-31 00:00:00 0.168
22 nucleus 2013-09-30 00:00:00 0.496
23 stem 2001-01-31 00:00:00 0.172
24 stem 2001-05-31 00:00:00 0.374
25 stem 2001-11-30 00:00:00 0.264
26 stem 2001-12-31 00:00:00 0.305
27 stem 2012-10-31 00:00:00 0.167
28 stem 2013-09-30 00:00:00 0.251
Lastly, if you want to put your columns in a different order, use reindex
:
最后,如果您想以不同的顺序放置列,请使用reindex
:
In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
Out[210]:
visits string date
0 0.177 current 2001-01-31 00:00:00
1 0.188 current 2001-05-31 00:00:00
2 0.259 current 2001-11-30 00:00:00
3 0.248 current 2001-12-31 00:00:00
4 0.068 current 2013-09-30 00:00:00
5 0.236 molecular 2001-01-31 00:00:00
6 0.119 molecular 2001-05-31 00:00:00
7 0.133 molecular 2001-11-30 00:00:00
8 0.063 molecular 2001-12-31 00:00:00
9 0.082 molecular 2012-10-31 00:00:00
10 0.123 molecular 2013-09-30 00:00:00
11 0.234 neuron 2001-01-31 00:00:00
12 0.129 neuron 2001-05-31 00:00:00
13 0.218 neuron 2001-11-30 00:00:00
14 0.128 neuron 2001-12-31 00:00:00
15 0.584 neuron 2012-10-31 00:00:00
16 0.063 neuron 2013-09-30 00:00:00
17 0.181 nucleus 2001-01-31 00:00:00
18 0.189 nucleus 2001-05-31 00:00:00
19 0.127 nucleus 2001-11-30 00:00:00
20 0.257 nucleus 2001-12-31 00:00:00
21 0.168 nucleus 2012-10-31 00:00:00
22 0.496 nucleus 2013-09-30 00:00:00
23 0.172 stem 2001-01-31 00:00:00
24 0.374 stem 2001-05-31 00:00:00
25 0.264 stem 2001-11-30 00:00:00
26 0.305 stem 2001-12-31 00:00:00
27 0.167 stem 2012-10-31 00:00:00
28 0.251 stem 2013-09-30 00:00:00