使用 Python 的 Pandas 按箱查找平均值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24702491/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using Python's Pandas to find average values by bins
提问by user3830166
I just started using pandas to analyze groundwater well data over time.
我刚刚开始使用 Pandas 来分析一段时间内的地下水井数据。
My data in a text file looks like (site_no, date, well_level):
我在文本文件中的数据看起来像 (site_no, date, well_level):
485438103132901 19800417 -7.1
485438103132901 19800506 -6.8
483622101085001 19790910 -6.7
485438103132901 19790731 -6.2
483845101112801 19801111 -5.37
484123101124601 19801111 -5.3
485438103132901 19770706 -4.98
I would like an output with average well levels binned by 5 year increments and with a count:
我想要一个平均井水位按 5 年递增并带有计数的输出:
site_no avg 1960-end1964 count avg 1965-end1969 count avg 1970-end1974 count
I am reading in the data with:
我正在阅读数据:
names = ['site_no','date','wtr_lvl']
df = pd.read_csv('D:\info.txt', sep='\t',names=names)
I can find the overall average by site with:
我可以通过以下方式找到站点的总体平均值:
avg = df.groupby(['site_no'])['wtr_lvl'].mean().reset_index()
My crude bin attempts use:
我的粗垃圾箱尝试使用:
a1 = df[df.date > 19600000]
a2 = a1[a1.date < 19650000]
avga2 = a2.groupby(['site_no'])['wtr_lvl'].mean()
My question: how can I join the results to display as desired? I tried merge, join, and append, but they do not allow for empty data frames (which happens). Also, I am sure there is a simple way to bin the data by the dates. Thanks.
我的问题:如何加入结果以根据需要显示?我尝试过合并、加入和追加,但它们不允许空数据框(会发生这种情况)。另外,我确信有一种简单的方法可以按日期对数据进行分类。谢谢。
采纳答案by CT Zhu
The most concise way is probably to convert this to a timeserisdata and them downsample to get the means:
最简洁的方法可能是将其转换为timeseris数据,然后对它们进行下采样以获得均值:
In [75]:
print df
ID Level
1
1980-04-17 485438103132901 -7.10
1980-05-06 485438103132901 -6.80
1979-09-10 483622101085001 -6.70
1979-07-31 485438103132901 -6.20
1980-11-11 483845101112801 -5.37
1980-11-11 484123101124601 -5.30
1977-07-06 485438103132901 -4.98
In [76]:
df.Level.resample('60M', how='mean')
#also may consider different time alias: '5A', '5BA', '5AS', etc:
#see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Out[76]:
1
1977-07-31 -4.980
1982-07-31 -6.245
Freq: 60M, Name: Level, dtype: float64
Alternatively, you may use groupbytogether with cut:
或者,您可以groupby与 一起使用cut:
In [99]:
print df.groupby(pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)).mean()
ID Level
[1960, 1965] NaN NaN
(1965, 1970] NaN NaN
(1970, 1975] NaN NaN
(1975, 1980] 4.847632e+14 -6.064286
And by ID also:
并且还通过 ID:
In [100]:
print df.groupby(['ID',
pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)]).mean()
Level
ID
483622101085001 (1975, 1980] -6.70
483845101112801 (1975, 1980] -5.37
484123101124601 (1975, 1980] -5.30
485438103132901 (1975, 1980] -6.27
回答by acushner
so what i like to do is create a separate column with the rounded bin number:
所以我喜欢做的是用四舍五入的 bin 编号创建一个单独的列:
bin_width = 50000
mult = 1. / bin_width
df['bin'] = np.floor(ser * mult + .5) / mult
then, just group by the bins themselves
然后,只需按垃圾箱本身分组
df.groupby('bin').mean()
another note, you can do multiple truth evaluations in one go:
另请注意,您可以一次性进行多项真值评估:
df[(df.date > a) & (df.date < b)]

