pandas Python - 按月聚合并计算平均值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23859840/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python - Aggregate by month and calculate average
提问by Jaroslav Klim?ík
I have a csv which looks like this:
我有一个看起来像这样的 csv:
Date,Sentiment
2014-01-03,0.4
2014-01-04,-0.03
2014-01-09,0.0
2014-01-10,0.07
2014-01-12,0.0
2014-02-24,0.0
2014-02-25,0.0
2014-02-25,0.0
2014-02-26,0.0
2014-02-28,0.0
2014-03-01,0.1
2014-03-02,-0.5
2014-03-03,0.0
2014-03-08,-0.06
2014-03-11,-0.13
2014-03-22,0.0
2014-03-23,0.33
2014-03-23,0.3
2014-03-25,-0.14
2014-03-28,-0.25
etc
And my goal is to aggregate date by months and calculate average of months. Dates might not start with 1. or January. Problem is that I have a lot of data, that means I have more years. For this purpose I would like to find the soonest date (month) and from there start counting months and their averages. For example:
我的目标是按月汇总日期并计算月数的平均值。日期可能不是以 1. 或 1 月开头。问题是我有很多数据,这意味着我有更多的年数。为此,我想找到最快的日期(月份),然后从那里开始计算月份及其平均值。例如:
Month count, average
1, 0.4 (<= the earliest month)
2, -0.3
3, 0.0
...
12, 0.1
13, -0.4 (<= new year but counting of month is continuing)
14, 0.3
I'm using Pandas to open csv
我正在使用 Pandas 打开 csv
data = pd.read_csv("pks.csv", sep=",")
so in data['Date']I have dates and in data['Sentiment']I have values. Any idea how to do it?
所以在data['Date']我有日期,在data['Sentiment']我有价值观。知道怎么做吗?
回答by Karl D.
Probably the simplest approach is to use the resamplecommand. First, when you read in your data make sure you parse the dates and set the date column as your index (ignore the StringIOpart and the header=True ... I am reading in your sample data from a multi-line string):
可能最简单的方法是使用resample命令。首先,当您读入数据时,请确保解析日期并将日期列设置为您的索引(忽略该StringIO部分和 header=True ...我正在从多行字符串中读取您的示例数据):
>>> df = pd.read_csv(StringIO(data),header=True,parse_dates=['Date'],
index_col='Date')
>>> df
Sentiment
Date
2014-01-03 0.40
2014-01-04 -0.03
2014-01-09 0.00
2014-01-10 0.07
2014-01-12 0.00
2014-02-24 0.00
2014-02-25 0.00
2014-02-25 0.00
2014-02-26 0.00
2014-02-28 0.00
2014-03-01 0.10
2014-03-02 -0.50
2014-03-03 0.00
2014-03-08 -0.06
2014-03-11 -0.13
2014-03-22 0.00
2014-03-23 0.33
2014-03-23 0.30
2014-03-25 -0.14
2014-03-28 -0.25
>>> df.resample('M').mean()
Sentiment
2014-01-31 0.088
2014-02-28 0.000
2014-03-31 -0.035
And if you want a month counter, you can add it after your resample:
如果你想要一个月份计数器,你可以在你的之后添加它resample:
>>> agg = df.resample('M',how='mean')
>>> agg['cnt'] = range(len(agg))
>>> agg
Sentiment cnt
2014-01-31 0.088 0
2014-02-28 0.000 1
2014-03-31 -0.035 2
You can also do this with the groupbymethod and the TimeGrouperfunction (group by month and then call the mean convenience method that is available with groupby).
您还可以使用groupby方法和TimeGrouper函数(按月分组,然后调用可用于 的均值便捷方法groupby)来执行此操作。
>>> df.groupby(pd.TimeGrouper(freq='M')).mean()
Sentiment
2014-01-31 0.088
2014-02-28 0.000
2014-03-31 -0.035
回答by pink.slash
To get the monthly average values of a Data Frame when the DataFrame has daily data rows 'Sentiment', I would:
要在 DataFrame 具有每日数据行“情绪”时获取 Data Frame 的月平均值,我会:
- Convert the column with the dates ,
df['dates']into the index of the DataFramedf:df.set_index('date',inplace=True) - Then I'll convert the index
datesinto a month-index:df.index.month - Finally I'll calculate the mean of the DataFrame GROUPED BY MONTH:
df.groupby(df.index.month).Sentiment.mean()
- 将带有日期的列转换
df['dates']为 DataFrame 的索引df:df.set_index('date',inplace=True) - 然后我会将索引
dates转换为月份索引:df.index.month - 最后,我将计算按月分组的 DataFrame 的平均值 :
df.groupby(df.index.month).Sentiment.mean()
I go slowly throw each step here:
我慢慢地把每一步扔到这里:
Generation DataFrame with dates and values
生成带有日期和值的 DataFrame
You need first to import Pandas and Numpy, as well as the module
datetimefrom datetime import datetimeGenerate a Column
'date'between 1/1/2019 and the 3/05/2019, at week 'W' intervals. And a column'Sentiment'with random values between 1-100:date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W') df = pd.DataFrame(date_rng, columns=['date']) df['Sentiment']=np.random.randint(0,100,size=(len(date_rng)))the
dfhas two columns'date'and'Sentiment':date Sentiment 0 2018-01-07 34 1 2018-01-14 32 2 2018-01-21 15 3 2018-01-28 0 4 2018-02-04 95 5 2018-02-11 53 6 2018-02-18 7 7 2018-02-25 35 8 2018-03-04 17Set
'date'column as the index of the DataFrame:df.set_index('date',inplace=True)dfhas one column'Sentiment'and the index is'date':Sentiment date 2018-01-07 34 2018-01-14 32 2018-01-21 15 2018-01-28 0 2018-02-04 95 2018-02-11 53 2018-02-18 7 2018-02-25 35 2018-03-04 17
您首先需要导入 Pandas 和 Numpy,以及模块
datetimefrom datetime import datetime'date'在 2019 年 1 月 1 日和 2019 年 3 月 5 日之间以“W”周为间隔生成一列。还有一个'Sentiment'随机值在 1-100 之间的列:date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W') df = pd.DataFrame(date_rng, columns=['date']) df['Sentiment']=np.random.randint(0,100,size=(len(date_rng)))将
df有两列'date'和'Sentiment':date Sentiment 0 2018-01-07 34 1 2018-01-14 32 2 2018-01-21 15 3 2018-01-28 0 4 2018-02-04 95 5 2018-02-11 53 6 2018-02-18 7 7 2018-02-25 35 8 2018-03-04 17将
'date'column设置为 DataFrame 的索引:df.set_index('date',inplace=True)df有一列'Sentiment',索引是'date':Sentiment date 2018-01-07 34 2018-01-14 32 2018-01-21 15 2018-01-28 0 2018-02-04 95 2018-02-11 53 2018-02-18 7 2018-02-25 35 2018-03-04 17
Capture the month number from the index
从索引中获取月份数
months=df.index.month
Obtain the mean value of each month grouping by month:
获取按月分组的每个月的平均值:
monthly_avg=df.groupby(months).Sentiment.mean()
The mean of the dataset by month 'monthly_avg'is:
数据集按月的平均值'monthly_avg'为:
date
1 20.25
2 47.50
3 17.00

