pandas Python - 按月聚合并计算平均值

Question

提问by Jaroslav Klim?ík

I have a csv which looks like this:

我有一个看起来像这样的 csv：

Date,Sentiment
2014-01-03,0.4
2014-01-04,-0.03
2014-01-09,0.0
2014-01-10,0.07
2014-01-12,0.0
2014-02-24,0.0
2014-02-25,0.0
2014-02-25,0.0
2014-02-26,0.0
2014-02-28,0.0
2014-03-01,0.1
2014-03-02,-0.5
2014-03-03,0.0
2014-03-08,-0.06
2014-03-11,-0.13
2014-03-22,0.0
2014-03-23,0.33
2014-03-23,0.3
2014-03-25,-0.14
2014-03-28,-0.25
etc

And my goal is to aggregate date by months and calculate average of months. Dates might not start with 1. or January. Problem is that I have a lot of data, that means I have more years. For this purpose I would like to find the soonest date (month) and from there start counting months and their averages. For example:

我的目标是按月汇总日期并计算月数的平均值。日期可能不是以 1. 或 1 月开头。问题是我有很多数据，这意味着我有更多的年数。为此，我想找到最快的日期（月份），然后从那里开始计算月份及其平均值。例如：

Month count, average
1, 0.4 (<= the earliest month)
2, -0.3
3, 0.0
...
12, 0.1
13, -0.4 (<= new year but counting of month is continuing)
14, 0.3

I'm using Pandas to open csv

我正在使用 Pandas 打开 csv

data = pd.read_csv("pks.csv", sep=",")

so in data['Date']I have dates and in data['Sentiment']I have values. Any idea how to do it?

所以在data['Date']我有日期，在data['Sentiment']我有价值观。知道怎么做吗？

Answer 1

回答by Karl D.

Probably the simplest approach is to use the resamplecommand. First, when you read in your data make sure you parse the dates and set the date column as your index (ignore the StringIOpart and the header=True ... I am reading in your sample data from a multi-line string):

可能最简单的方法是使用resample命令。首先，当您读入数据时，请确保解析日期并将日期列设置为您的索引（忽略该StringIO部分和 header=True ...我正在从多行字符串中读取您的示例数据）：

>>> df = pd.read_csv(StringIO(data),header=True,parse_dates=['Date'],
                     index_col='Date')
>>> df

            Sentiment
Date
2014-01-03       0.40
2014-01-04      -0.03
2014-01-09       0.00
2014-01-10       0.07
2014-01-12       0.00
2014-02-24       0.00 
2014-02-25       0.00
2014-02-25       0.00
2014-02-26       0.00
2014-02-28       0.00
2014-03-01       0.10
2014-03-02      -0.50
2014-03-03       0.00
2014-03-08      -0.06
2014-03-11      -0.13
2014-03-22       0.00
2014-03-23       0.33
2014-03-23       0.30
2014-03-25      -0.14
2014-03-28      -0.25


>>> df.resample('M').mean()

            Sentiment
2014-01-31      0.088
2014-02-28      0.000
2014-03-31     -0.035

And if you want a month counter, you can add it after your resample:

如果你想要一个月份计数器，你可以在你的之后添加它resample：

>>> agg = df.resample('M',how='mean')
>>> agg['cnt'] = range(len(agg))
>>> agg

            Sentiment  cnt
2014-01-31      0.088    0
2014-02-28      0.000    1
2014-03-31     -0.035    2

You can also do this with the groupbymethod and the TimeGrouperfunction (group by month and then call the mean convenience method that is available with groupby).

您还可以使用groupby方法和TimeGrouper函数（按月分组，然后调用可用于的均值便捷方法groupby）来执行此操作。

>>> df.groupby(pd.TimeGrouper(freq='M')).mean()

            Sentiment
2014-01-31      0.088
2014-02-28      0.000
2014-03-31     -0.035

Answer 2

回答by pink.slash

To get the monthly average values of a Data Frame when the DataFrame has daily data rows 'Sentiment', I would:

要在 DataFrame 具有每日数据行“情绪”时获取 Data Frame 的月平均值，我会：

Convert the column with the dates , df['dates']into the index of the DataFrame df: df.set_index('date',inplace=True)
Then I'll convert the index datesinto a month-index: df.index.month
Finally I'll calculate the mean of the DataFrame GROUPED BY MONTH: df.groupby(df.index.month).Sentiment.mean()

将带有日期的列转换df['dates']为 DataFrame 的索引df：df.set_index('date',inplace=True)
然后我会将索引dates转换为月份索引：df.index.month
最后，我将计算按月分组的 DataFrame 的平均值：df.groupby(df.index.month).Sentiment.mean()

I go slowly throw each step here:

我慢慢地把每一步扔到这里：

Generation DataFrame with dates and values

生成带有日期和值的 DataFrame

You need first to import Pandas and Numpy, as well as the module datetime
```
from datetime import datetime
```

Generate a Column 'date'between 1/1/2019 and the 3/05/2019, at week 'W' intervals. And a column 'Sentiment'with random values between 1-100:

date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W')
df = pd.DataFrame(date_rng, columns=['date'])
df['Sentiment']=np.random.randint(0,100,size=(len(date_rng)))

the dfhas two columns 'date'and 'Sentiment':

        date  Sentiment
0 2018-01-07         34
1 2018-01-14         32
2 2018-01-21         15
3 2018-01-28          0
4 2018-02-04         95
5 2018-02-11         53
6 2018-02-18          7
7 2018-02-25         35
8 2018-03-04         17

Set `'date'`column as the index of the DataFrame:

df.set_index('date',inplace=True)

dfhas one column 'Sentiment'and the index is 'date':

            Sentiment
date                 
2018-01-07         34
2018-01-14         32
2018-01-21         15
2018-01-28          0
2018-02-04         95
2018-02-11         53
2018-02-18          7
2018-02-25         35
2018-03-04         17

您首先需要导入 Pandas 和 Numpy，以及模块 datetime
```
from datetime import datetime
```

'date'在 2019 年 1 月 1 日和 2019 年 3 月 5 日之间以“W”周为间隔生成一列。还有一个'Sentiment'随机值在 1-100 之间的列：

date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W')
df = pd.DataFrame(date_rng, columns=['date'])
df['Sentiment']=np.random.randint(0,100,size=(len(date_rng)))

将df有两列'date'和'Sentiment'：

        date  Sentiment
0 2018-01-07         34
1 2018-01-14         32
2 2018-01-21         15
3 2018-01-28          0
4 2018-02-04         95
5 2018-02-11         53
6 2018-02-18          7
7 2018-02-25         35
8 2018-03-04         17

将`'date'`column设置为 DataFrame 的索引：

df.set_index('date',inplace=True)

df有一列'Sentiment'，索引是 'date'：

            Sentiment
date                 
2018-01-07         34
2018-01-14         32
2018-01-21         15
2018-01-28          0
2018-02-04         95
2018-02-11         53
2018-02-18          7
2018-02-25         35
2018-03-04         17

Capture the month number from the index

从索引中获取月份数

    months=df.index.month

Obtain the mean value of each month grouping by month:

获取按月分组的每个月的平均值：

    monthly_avg=df.groupby(months).Sentiment.mean()

The mean of the dataset by month `'monthly_avg'`is:

数据集按月的平均值`'monthly_avg'`为：

pandas Python - 按月聚合并计算平均值

提问by Jaroslav Klim?ík

回答by Karl D.

回答by pink.slash

Generation DataFrame with dates and values

生成带有日期和值的 DataFrame

Set `'date'`column as the index of the DataFrame:

将`'date'`column设置为 DataFrame 的索引：

Capture the month number from the index

从索引中获取月份数

Obtain the mean value of each month grouping by month:

获取按月分组的每个月的平均值：

The mean of the dataset by month `'monthly_avg'`is:

数据集按月的平均值`'monthly_avg'`为：

相关推荐

最近更新

标签

pandas Python - 按月聚合并计算平均值

提问by Jaroslav Klim?ík

回答by Karl D.

回答by pink.slash

Generation DataFrame with dates and values

生成带有日期和值的 DataFrame

Set 'date'column as the index of the DataFrame:

将'date'column设置为 DataFrame 的索引：

Capture the month number from the index

从索引中获取月份数

Obtain the mean value of each month grouping by month:

获取按月分组的每个月的平均值：

The mean of the dataset by month 'monthly_avg'is:

数据集按月的平均值'monthly_avg'为：

相关推荐

Pandas DataFrame，如何删除总和为 0 的所有列和行

通过主键将 Pandas 数据框附加到 sqlite 表

Python pandas / matplotlib 在条形图列上方注释标签

更改 Pandas 数据帧类型时的异常处理

相关推荐

最近更新

标签

Set `'date'`column as the index of the DataFrame:

将`'date'`column设置为 DataFrame 的索引：

The mean of the dataset by month `'monthly_avg'`is:

数据集按月的平均值`'monthly_avg'`为：