pandas 如何使用 groupby 获得熊猫的月均值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21030171/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:32:19  来源:igfitidea点击:

How get monthly mean in pandas using groupby

pythonpandas

提问by user1345283

I have the next DataFrame:

我有下一个数据帧:

data=pd.read_csv('anual.csv', parse_dates='Fecha', index_col=0)
data

DatetimeIndex: 290 entries, 2011-01-01 00:00:00 to 2011-12-31 00:00:00
Data columns (total 12 columns):
HR             290  non-null values
PreciAcu       290  non-null values
RadSolar       290  non-null values
T              290  non-null values
Presion        290  non-null values
Tmax           290  non-null values
HRmax          290  non-null values
Presionmax     290  non-null values
RadSolarmax    290  non-null values
Tmin           290  non-null values
HRmin          290  non-null values
Presionmin     290  non-null values
dtypes: float64(4), int64(8)

where:

在哪里:

data['HR']

Fecha
2011-01-01    37
2011-02-01    70
2011-03-01    62
2011-04-01    69
2011-05-01    72
2011-06-01    71
2011-07-01    71
2011-08-01    70
2011-09-01    40
...
2011-12-17    92
2011-12-18    78
2011-12-19    79
2011-12-20    76
2011-12-21    78
2011-12-22    80
2011-12-23    72
2011-12-24    70

In addition, some months are not always complete. My goal is to calculate the average of each month from daily data. This is achieved as follows:

此外,有些月份并不总是完整的。我的目标是根据每日数据计算每个月的平均值。这是通过以下方式实现的:

monthly=data.resample('M', how='mean')

                HR   PreciAcu    RadSolar         T        Presion     Tmax  
Fecha                                                                         
2011-01-31  68.586207   3.744828  163.379310  17.496552        0  25.875862   
2011-02-28  68.666667   1.966667  208.000000  18.854167        0  28.879167   
2011-03-31  69.136364   3.495455  218.090909  20.986364        0  30.359091   
2011-04-30  68.956522   1.913043  221.130435  22.165217        0  31.708696   
2011-05-31  72.700000   0.550000  201.100000  18.900000        0  27.460000   
2011-06-30  70.821429   6.050000  214.000000  23.032143        0  30.621429   
2011-07-31  78.034483   5.810345  188.206897  21.503448        0  27.951724   
2011-08-31  71.750000   1.028571  214.750000  22.439286        0  30.657143   
2011-09-30  72.481481   0.185185  196.962963  21.714815        0  29.596296    
2011-10-31  68.083333   1.770833  224.958333  18.683333        0  27.075000   
2011-11-30  71.750000   0.812500  169.625000  18.925000        0  26.237500   
2011-12-31  71.833333   0.160000  159.533333  17.260000        0  25.403333 

The first error I find is in the column of precipitation, since all observations are 0 in January and an average of 3.74 is obtained for this particular month.

我发现的第一个错误是在降水列中,因为一月份的所有观测值都是 0,而这个特定月份的平均值为 3.74。

When averages in Excel and compare them with the results above, there is significant variation. For Example, the mean of HR for Febrero is

当在 Excel 中取平均值并将它们与上述结果进行比较时,存在显着差异。例如,Febrero 的 HR 平均值为

   mean HR using pandas=68.66

   mean HR using excel=67 

Another detail I found:

我发现的另一个细节:

   data['PreciAcu']['2011-01'].count()

   29 and should be 31

Am I doing something wrong? How I can fix this error?

难道我做错了什么?我该如何解决这个错误?

Annex csv file:

附件csv文件:

[link] https://www.dropbox.com/s/p5hl137bqm82j41/anual.csv

[链接] https://www.dropbox.com/s/p5hl137bqm82j41/anual.csv

回答by DSM

Your date column is being misinterpreted, because it's in DD/MM/YYYY format. Set dayfirst=Trueinstead:

您的日期列被误解了,因为它是 DD/MM/YYYY 格式。dayfirst=True改为设置:

>>> df = pd.read_csv('anual.csv', parse_dates='Fecha', dayfirst=True, index_col=0, sep="\s+")
>>> df['PreciAcu']['2011-01'].count()
31
>>> df.resample("M").mean()
                   HR   PreciAcu    RadSolar          T  Presion       Tmax  \
Fecha                                                                         
2011-01-31  68.774194   0.000000  162.354839  16.535484        0  25.393548   
2011-02-28  67.000000   0.000000  193.481481  15.418519        0  25.696296   
2011-03-31  59.083333   0.850000  254.541667  21.295833        0  32.325000   
2011-04-30  61.200000   1.312000  260.640000  24.676000        0  34.760000   
2011-05-31        NaN        NaN         NaN        NaN      NaN        NaN   
2011-06-30  68.428571   8.576190  236.619048  25.009524        0  32.028571   
2011-07-31  81.518519  11.488889  185.407407  22.429630        0  27.681481   
2011-08-31  76.451613   0.677419  219.645161  23.677419        0  30.719355   
2011-09-30  77.533333   2.883333  196.100000  21.573333        0  28.723333   
2011-10-31  73.120000   1.260000  196.280000  19.552000        0  27.636000   
2011-11-30  71.277778 -79.333333  148.555556  18.250000        0  26.511111   
2011-12-31  73.741935   0.067742  134.677419  15.687097        0  24.019355   

                HRmax  Presionmax       Tmin  
Fecha                                         
2011-01-31  92.709677           0  10.909677  
2011-02-28  92.111111           0   8.325926  
2011-03-31  89.291667           0  13.037500  
2011-04-30  89.400000           0  17.328000  
2011-05-31        NaN         NaN        NaN  
2011-06-30  92.095238           0  19.761905  
2011-07-31  97.185185           0  18.774074  
2011-08-31  96.903226           0  18.670968  
2011-09-30  97.200000           0  16.373333  
2011-10-31  97.000000           0  13.412000  
2011-11-30  94.555556           0  11.877778  
2011-12-31  94.161290           0  10.070968  

[12 rows x 9 columns]


(Note, though - I'd forgotten this -- that dayfirst=Trueisn't strict, see here. Maybe using date_parserwould be safer.)

(但请注意 - 我忘记了这一点 - 这dayfirst=True并不严格,请参阅此处。也许使用date_parser会更安全。)