使用 Python 的 Pandas 按箱查找平均值

Question

提问by user3830166

I just started using pandas to analyze groundwater well data over time.

我刚刚开始使用 Pandas 来分析一段时间内的地下水井数据。

My data in a text file looks like (site_no, date, well_level):

我在文本文件中的数据看起来像 (site_no, date, well_level)：

485438103132901 19800417    -7.1

485438103132901 19800506    -6.8

483622101085001 19790910    -6.7

485438103132901 19790731    -6.2

483845101112801 19801111    -5.37

484123101124601 19801111    -5.3

485438103132901 19770706    -4.98

I would like an output with average well levels binned by 5 year increments and with a count:

我想要一个平均井水位按 5 年递增并带有计数的输出：

site_no   avg 1960-end1964  count    avg 1965-end1969  count    avg 1970-end1974 count

I am reading in the data with:

我正在阅读数据：

names = ['site_no','date','wtr_lvl']
df = pd.read_csv('D:\info.txt', sep='\t',names=names)

I can find the overall average by site with:

我可以通过以下方式找到站点的总体平均值：

avg = df.groupby(['site_no'])['wtr_lvl'].mean().reset_index()

My crude bin attempts use:

我的粗垃圾箱尝试使用：

a1 = df[df.date > 19600000]
a2 = a1[a1.date < 19650000]
avga2 = a2.groupby(['site_no'])['wtr_lvl'].mean()

My question: how can I join the results to display as desired? I tried merge, join, and append, but they do not allow for empty data frames (which happens). Also, I am sure there is a simple way to bin the data by the dates. Thanks.

我的问题：如何加入结果以根据需要显示？我尝试过合并、加入和追加，但它们不允许空数据框（会发生这种情况）。另外，我确信有一种简单的方法可以按日期对数据进行分类。谢谢。

Answer 1

采纳答案by CT Zhu

The most concise way is probably to convert this to a timeserisdata and them downsample to get the means:

最简洁的方法可能是将其转换为timeseris数据，然后对它们进行下采样以获得均值：

In [75]:

print df
                         ID  Level
1                                 
1980-04-17  485438103132901  -7.10
1980-05-06  485438103132901  -6.80
1979-09-10  483622101085001  -6.70
1979-07-31  485438103132901  -6.20
1980-11-11  483845101112801  -5.37
1980-11-11  484123101124601  -5.30
1977-07-06  485438103132901  -4.98
In [76]:

df.Level.resample('60M', how='mean') 
#also may consider different time alias: '5A', '5BA', '5AS', etc:
#see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Out[76]:
1
1977-07-31   -4.980
1982-07-31   -6.245
Freq: 60M, Name: Level, dtype: float64

Alternatively, you may use groupbytogether with cut:

或者，您可以groupby与一起使用cut：

In [99]:

print df.groupby(pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)).mean()
                        ID     Level
[1960, 1965]           NaN       NaN
(1965, 1970]           NaN       NaN
(1970, 1975]           NaN       NaN
(1975, 1980]  4.847632e+14 -6.064286

And by ID also:

并且还通过 ID：

In [100]:

print df.groupby(['ID', 
                  pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)]).mean()
                              Level
ID                                 
483622101085001 (1975, 1980]  -6.70
483845101112801 (1975, 1980]  -5.37
484123101124601 (1975, 1980]  -5.30
485438103132901 (1975, 1980]  -6.27

Answer 2

回答by acushner

so what i like to do is create a separate column with the rounded bin number:

所以我喜欢做的是用四舍五入的 bin 编号创建一个单独的列：

    bin_width = 50000
    mult = 1. / bin_width
    df['bin'] = np.floor(ser * mult + .5) / mult

then, just group by the bins themselves

然后，只需按垃圾箱本身分组

    df.groupby('bin').mean()

another note, you can do multiple truth evaluations in one go:

另请注意，您可以一次性进行多项真值评估：

    df[(df.date > a) & (df.date < b)]

使用 Python 的 Pandas 按箱查找平均值

提问by user3830166

采纳答案by CT Zhu

回答by acushner

相关推荐

最近更新

标签

使用 Python 的 Pandas 按箱查找平均值

提问by user3830166

采纳答案by CT Zhu

回答by acushner

相关推荐

pandas 熊猫 groupby 后缺少列

在 pandas.DataFrame 的对角线上设置值

pandas 是否可以在 Python ggplot 上绘制多折线图？

Pandas：时间戳索引四舍五入到最接近的第 5 分钟

相关推荐

最近更新

标签