pandas Python - 按月聚合并计算平均值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23859840/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:05:14  来源:igfitidea点击:

Python - Aggregate by month and calculate average

pythondatecsvpandasaggregate

提问by Jaroslav Klim?ík

I have a csv which looks like this:

我有一个看起来像这样的 csv:

Date,Sentiment
2014-01-03,0.4
2014-01-04,-0.03
2014-01-09,0.0
2014-01-10,0.07
2014-01-12,0.0
2014-02-24,0.0
2014-02-25,0.0
2014-02-25,0.0
2014-02-26,0.0
2014-02-28,0.0
2014-03-01,0.1
2014-03-02,-0.5
2014-03-03,0.0
2014-03-08,-0.06
2014-03-11,-0.13
2014-03-22,0.0
2014-03-23,0.33
2014-03-23,0.3
2014-03-25,-0.14
2014-03-28,-0.25
etc

And my goal is to aggregate date by months and calculate average of months. Dates might not start with 1. or January. Problem is that I have a lot of data, that means I have more years. For this purpose I would like to find the soonest date (month) and from there start counting months and their averages. For example:

我的目标是按月汇总日期并计算月数的平均值。日期可能不是以 1. 或 1 月开头。问题是我有很多数据,这意味着我有更多的年数。为此,我想找到最快的日期(月份),然后从那里开始计算月份及其平均值。例如:

Month count, average
1, 0.4 (<= the earliest month)
2, -0.3
3, 0.0
...
12, 0.1
13, -0.4 (<= new year but counting of month is continuing)
14, 0.3

I'm using Pandas to open csv

我正在使用 Pandas 打开 csv

data = pd.read_csv("pks.csv", sep=",")

so in data['Date']I have dates and in data['Sentiment']I have values. Any idea how to do it?

所以在data['Date']我有日期,在data['Sentiment']我有价值观。知道怎么做吗?

回答by Karl D.

Probably the simplest approach is to use the resamplecommand. First, when you read in your data make sure you parse the dates and set the date column as your index (ignore the StringIOpart and the header=True ... I am reading in your sample data from a multi-line string):

可能最简单的方法是使用resample命令。首先,当您读入数据时,请确保解析日期并将日期列设置为您的索引(忽略该StringIO部分和 header=True ...我正在从多行字符串中读取您的示例数据):

>>> df = pd.read_csv(StringIO(data),header=True,parse_dates=['Date'],
                     index_col='Date')
>>> df

            Sentiment
Date
2014-01-03       0.40
2014-01-04      -0.03
2014-01-09       0.00
2014-01-10       0.07
2014-01-12       0.00
2014-02-24       0.00 
2014-02-25       0.00
2014-02-25       0.00
2014-02-26       0.00
2014-02-28       0.00
2014-03-01       0.10
2014-03-02      -0.50
2014-03-03       0.00
2014-03-08      -0.06
2014-03-11      -0.13
2014-03-22       0.00
2014-03-23       0.33
2014-03-23       0.30
2014-03-25      -0.14
2014-03-28      -0.25


>>> df.resample('M').mean()

            Sentiment
2014-01-31      0.088
2014-02-28      0.000
2014-03-31     -0.035

And if you want a month counter, you can add it after your resample:

如果你想要一个月份计数器,你可以在你的之后添加它resample

>>> agg = df.resample('M',how='mean')
>>> agg['cnt'] = range(len(agg))
>>> agg

            Sentiment  cnt
2014-01-31      0.088    0
2014-02-28      0.000    1
2014-03-31     -0.035    2

You can also do this with the groupbymethod and the TimeGrouperfunction (group by month and then call the mean convenience method that is available with groupby).

您还可以使用groupby方法和TimeGrouper函数(按月分组,然后调用可用于 的均值便捷方法groupby)来执行此操作。

>>> df.groupby(pd.TimeGrouper(freq='M')).mean()

            Sentiment
2014-01-31      0.088
2014-02-28      0.000
2014-03-31     -0.035

回答by pink.slash

To get the monthly average values of a Data Frame when the DataFrame has daily data rows 'Sentiment', I would:

要在 DataFrame 具有每日数据行“情绪”时获取 Data Frame 的月平均值,我会:

  1. Convert the column with the dates , df['dates']into the index of the DataFrame df: df.set_index('date',inplace=True)
  2. Then I'll convert the index datesinto a month-index: df.index.month
  3. Finally I'll calculate the mean of the DataFrame GROUPED BY MONTH: df.groupby(df.index.month).Sentiment.mean()
  1. 将带有日期的列转换df['dates']为 DataFrame 的索引dfdf.set_index('date',inplace=True)
  2. 然后我会将索引dates转换为月份索引:df.index.month
  3. 最后,我将计算按月分组的 DataFrame 的平均值 :df.groupby(df.index.month).Sentiment.mean()

I go slowly throw each step here:

我慢慢地把每一步扔到这里:

Generation DataFrame with dates and values

生成带有日期和值的 DataFrame

  • You need first to import Pandas and Numpy, as well as the module datetime

    from datetime import datetime
    
  • Generate a Column 'date'between 1/1/2019 and the 3/05/2019, at week 'W' intervals. And a column 'Sentiment'with random values between 1-100:

    date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W')
    df = pd.DataFrame(date_rng, columns=['date'])
    df['Sentiment']=np.random.randint(0,100,size=(len(date_rng)))
    
  • the dfhas two columns 'date'and 'Sentiment':

            date  Sentiment
    0 2018-01-07         34
    1 2018-01-14         32
    2 2018-01-21         15
    3 2018-01-28          0
    4 2018-02-04         95
    5 2018-02-11         53
    6 2018-02-18          7
    7 2018-02-25         35
    8 2018-03-04         17
    

    Set 'date'column as the index of the DataFrame:

    df.set_index('date',inplace=True)
    
  • dfhas one column 'Sentiment'and the index is 'date':

                Sentiment
    date                 
    2018-01-07         34
    2018-01-14         32
    2018-01-21         15
    2018-01-28          0
    2018-02-04         95
    2018-02-11         53
    2018-02-18          7
    2018-02-25         35
    2018-03-04         17
    
  • 您首先需要导入 Pandas 和 Numpy,以及模块 datetime

    from datetime import datetime
    
  • 'date'在 2019 年 1 月 1 日和 2019 年 3 月 5 日之间以“W”周为间隔生成一列。还有一个'Sentiment'随机值在 1-100 之间的列:

    date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W')
    df = pd.DataFrame(date_rng, columns=['date'])
    df['Sentiment']=np.random.randint(0,100,size=(len(date_rng)))
    
  • df有两列'date''Sentiment'

            date  Sentiment
    0 2018-01-07         34
    1 2018-01-14         32
    2 2018-01-21         15
    3 2018-01-28          0
    4 2018-02-04         95
    5 2018-02-11         53
    6 2018-02-18          7
    7 2018-02-25         35
    8 2018-03-04         17
    

    'date'column设置为 DataFrame 的索引:

    df.set_index('date',inplace=True)
    
  • df有一列'Sentiment',索引是 'date'

                Sentiment
    date                 
    2018-01-07         34
    2018-01-14         32
    2018-01-21         15
    2018-01-28          0
    2018-02-04         95
    2018-02-11         53
    2018-02-18          7
    2018-02-25         35
    2018-03-04         17
    

Capture the month number from the index

从索引中获取月份数

    months=df.index.month

Obtain the mean value of each month grouping by month:

获取按月分组的每个月的平均值:

    monthly_avg=df.groupby(months).Sentiment.mean()

The mean of the dataset by month 'monthly_avg'is:

数据集按月的平均值'monthly_avg'为:

    date
    1    20.25
    2    47.50
    3    17.00