在 python/pandas 中按月对每日数据进行分组，然后进行标准化

Question

提问by user7289

I have the table below in a Pandas DataFrame:

我在 Pandas 中有下表DataFrame：

    q_string    q_visits    q_date
0   nucleus         1790        2012-10-02 00:00:00
1   neuron          364         2012-10-02 00:00:00
2   current         280         2012-10-02 00:00:00
3   molecular       259         2012-10-02 00:00:00
4   stem            201         2012-10-02 00:00:00

The table contains query volume from a server log, by day. I would like to do 2 things:

该表包含来自服务器日志的查询量，按天计算。我想做两件事：

I would like to group queries by month summing the query volume of a query for the whole month e.g. if 'molecular' was present on the 2012-10-02 with volume 1000 and on the 2012-10-03 with volume 500, then it should have an entry in the new table of 1500 (volume) with date 2012-10-31 (end of the month end-point representing the month – all dates in the transformed table will be month ends representing the whole month to which they relate).
I want to add a 5th column which contains the month-normalized q_visits. I.e., a term's monthly query volume divided by the total query volume for the month across all terms.

我想按月对查询进行分组，汇总整个月的查询量，例如，如果 2012-10-02 的卷为 1000 和 2012-10-03 的卷为 500，则它新表中应该有一个条目为 1500（卷），日期为 2012-10-31（月末结束点代表月份 - 转换表中的所有日期都将是代表与其相关的整个月份的月末））。
我想添加包含月份的第 5 列-normalized q_visits。即，一个术语的每月查询量除以所有术语的当月总查询量。

What is the best way of doing this?

这样做的最佳方法是什么？

Answer 1

采纳答案by Phillip Cloud

If I understand you correctly:

如果我理解正确：

For (1) do this:

对于（1）这样做：

Make some fake data by sampling from the values you gave and some random dates and # of visits:

通过从您提供的值和一些随机日期和访问次数中采样来制作一些假数据：

In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')

In [180]: visits = Series(poisson(1000, size=100), name='date')

In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')

In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})

In [183]: df.head()
Out[183]:
                 date   string  visits
0 2001-11-15 00:00:00  current     997
1 2001-11-15 00:00:00  current     974
2 2012-10-02 00:00:00     stem     982
3 2001-12-01 00:00:00     stem     984
4 2001-01-01 00:00:00  current     989

In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')

In [187]: resamp.head()
Out[187]:
                    visits
string  date
current 2001-01-31    2996
        2001-02-28     NaN
        2001-03-31     NaN
        2001-04-30     NaN
        2001-05-31    3016

NaNis there because there were no visits with that query string in those months.

NaN是因为在那几个月没有访问该查询字符串。

For (2), group by the dates and then divide by the sum:

对于 (2)，按日期分组，然后除以总和：

In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())

In [189]: g.head()
Out[189]:
                    visits
string  date
current 2001-01-31   0.177
        2001-02-28     NaN
        2001-03-31     NaN
        2001-04-30     NaN
        2001-05-31   0.188

Just to convince you that (2) is doing what you want:

只是为了让您相信 (2) 正在做您想做的事：

In [176]: h = g.sortlevel('date').head()

In [177]: h
Out[177]:
                      visits
string    date
current   2001-01-31   0.077
molecular 2001-01-31   0.228
neuron    2001-01-31   0.073
nucleus   2001-01-31   0.234
stem      2001-01-31   0.388

In [178]: h.sum()
Out[178]:
visits    1
dtype: float64

If you want to convert resampinto a DataFrameand remove the NaNs do:

如果要转换resamp为 aDataFrame并删除NaNs ，请执行以下操作：

In [196]: resamp.dropna()
Out[196]:
                      visits
string    date
current   2001-01-31    2996
          2001-05-31    3016
          2001-11-30    5959
          2001-12-31    3998
          2013-09-30    1077
molecular 2001-01-31    3984
          2001-05-31    1911
          2001-11-30    3054
          2001-12-31    1020
          2012-10-31     977
          2013-09-30    1947
neuron    2001-01-31    3961
          2001-05-31    2069
          2001-11-30    5010
          2001-12-31    2065
          2012-10-31    6973
          2013-09-30     994
nucleus   2001-01-31    3060
          2001-05-31    3035
          2001-11-30    2924
          2001-12-31    4144
          2012-10-31    2004
          2013-09-30    7881
stem      2001-01-31    2911
          2001-05-31    5994
          2001-11-30    6072
          2001-12-31    4916
          2012-10-31    1991
          2013-09-30    3977

In [197]: resamp.dropna().reset_index()
Out[197]:
       string                date  visits
0     current 2001-01-31 00:00:00    2996
1     current 2001-05-31 00:00:00    3016
2     current 2001-11-30 00:00:00    5959
3     current 2001-12-31 00:00:00    3998
4     current 2013-09-30 00:00:00    1077
5   molecular 2001-01-31 00:00:00    3984
6   molecular 2001-05-31 00:00:00    1911
7   molecular 2001-11-30 00:00:00    3054
8   molecular 2001-12-31 00:00:00    1020
9   molecular 2012-10-31 00:00:00     977
10  molecular 2013-09-30 00:00:00    1947
11     neuron 2001-01-31 00:00:00    3961
12     neuron 2001-05-31 00:00:00    2069
13     neuron 2001-11-30 00:00:00    5010
14     neuron 2001-12-31 00:00:00    2065
15     neuron 2012-10-31 00:00:00    6973
16     neuron 2013-09-30 00:00:00     994
17    nucleus 2001-01-31 00:00:00    3060
18    nucleus 2001-05-31 00:00:00    3035
19    nucleus 2001-11-30 00:00:00    2924
20    nucleus 2001-12-31 00:00:00    4144
21    nucleus 2012-10-31 00:00:00    2004
22    nucleus 2013-09-30 00:00:00    7881
23       stem 2001-01-31 00:00:00    2911
24       stem 2001-05-31 00:00:00    5994
25       stem 2001-11-30 00:00:00    6072
26       stem 2001-12-31 00:00:00    4916
27       stem 2012-10-31 00:00:00    1991
28       stem 2013-09-30 00:00:00    3977

You can of course do this for gas well:

你当然也可以这样做g：

In [198]: g.dropna()
Out[198]:
                      visits
string    date
current   2001-01-31   0.177
          2001-05-31   0.188
          2001-11-30   0.259
          2001-12-31   0.248
          2013-09-30   0.068
molecular 2001-01-31   0.236
          2001-05-31   0.119
          2001-11-30   0.133
          2001-12-31   0.063
          2012-10-31   0.082
          2013-09-30   0.123
neuron    2001-01-31   0.234
          2001-05-31   0.129
          2001-11-30   0.218
          2001-12-31   0.128
          2012-10-31   0.584
          2013-09-30   0.063
nucleus   2001-01-31   0.181
          2001-05-31   0.189
          2001-11-30   0.127
          2001-12-31   0.257
          2012-10-31   0.168
          2013-09-30   0.496
stem      2001-01-31   0.172
          2001-05-31   0.374
          2001-11-30   0.264
          2001-12-31   0.305
          2012-10-31   0.167
          2013-09-30   0.251

In [199]: g.dropna().reset_index()
Out[199]:
       string                date  visits
0     current 2001-01-31 00:00:00   0.177
1     current 2001-05-31 00:00:00   0.188
2     current 2001-11-30 00:00:00   0.259
3     current 2001-12-31 00:00:00   0.248
4     current 2013-09-30 00:00:00   0.068
5   molecular 2001-01-31 00:00:00   0.236
6   molecular 2001-05-31 00:00:00   0.119
7   molecular 2001-11-30 00:00:00   0.133
8   molecular 2001-12-31 00:00:00   0.063
9   molecular 2012-10-31 00:00:00   0.082
10  molecular 2013-09-30 00:00:00   0.123
11     neuron 2001-01-31 00:00:00   0.234
12     neuron 2001-05-31 00:00:00   0.129
13     neuron 2001-11-30 00:00:00   0.218
14     neuron 2001-12-31 00:00:00   0.128
15     neuron 2012-10-31 00:00:00   0.584
16     neuron 2013-09-30 00:00:00   0.063
17    nucleus 2001-01-31 00:00:00   0.181
18    nucleus 2001-05-31 00:00:00   0.189
19    nucleus 2001-11-30 00:00:00   0.127
20    nucleus 2001-12-31 00:00:00   0.257
21    nucleus 2012-10-31 00:00:00   0.168
22    nucleus 2013-09-30 00:00:00   0.496
23       stem 2001-01-31 00:00:00   0.172
24       stem 2001-05-31 00:00:00   0.374
25       stem 2001-11-30 00:00:00   0.264
26       stem 2001-12-31 00:00:00   0.305
27       stem 2012-10-31 00:00:00   0.167
28       stem 2013-09-30 00:00:00   0.251

Lastly, if you want to put your columns in a different order, use reindex:

最后，如果您想以不同的顺序放置列，请使用reindex：

In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
Out[210]:
    visits     string                date
0    0.177    current 2001-01-31 00:00:00
1    0.188    current 2001-05-31 00:00:00
2    0.259    current 2001-11-30 00:00:00
3    0.248    current 2001-12-31 00:00:00
4    0.068    current 2013-09-30 00:00:00
5    0.236  molecular 2001-01-31 00:00:00
6    0.119  molecular 2001-05-31 00:00:00
7    0.133  molecular 2001-11-30 00:00:00
8    0.063  molecular 2001-12-31 00:00:00
9    0.082  molecular 2012-10-31 00:00:00
10   0.123  molecular 2013-09-30 00:00:00
11   0.234     neuron 2001-01-31 00:00:00
12   0.129     neuron 2001-05-31 00:00:00
13   0.218     neuron 2001-11-30 00:00:00
14   0.128     neuron 2001-12-31 00:00:00
15   0.584     neuron 2012-10-31 00:00:00
16   0.063     neuron 2013-09-30 00:00:00
17   0.181    nucleus 2001-01-31 00:00:00
18   0.189    nucleus 2001-05-31 00:00:00
19   0.127    nucleus 2001-11-30 00:00:00
20   0.257    nucleus 2001-12-31 00:00:00
21   0.168    nucleus 2012-10-31 00:00:00
22   0.496    nucleus 2013-09-30 00:00:00
23   0.172       stem 2001-01-31 00:00:00
24   0.374       stem 2001-05-31 00:00:00
25   0.264       stem 2001-11-30 00:00:00
26   0.305       stem 2001-12-31 00:00:00
27   0.167       stem 2012-10-31 00:00:00
28   0.251       stem 2013-09-30 00:00:00

在 python/pandas 中按月对每日数据进行分组，然后进行标准化

提问by user7289

采纳答案by Phillip Cloud

相关推荐

最近更新

标签

在 python/pandas 中按月对每日数据进行分组，然后进行标准化

提问by user7289

采纳答案by Phillip Cloud

相关推荐

循环的 Python 列表理解

Python 中的内存视图到底是什么？

python OpenCV - 将 alpha 通道添加到 RGB 图像

Python 如何将字典记录到日志文件中？

相关推荐

最近更新

标签