具有特定列聚合功能的 Pandas df.resample

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44289526/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:42:37  来源:igfitidea点击:

Pandas df.resample with column-specific aggregation function

pythonpandas

提问by knub

With pandas.DataFrame.resampleI can downsample a DataFrame:

使用pandas.DataFrame.resample我可以对 DataFrame 进行下采样:

df.resample("3s", how="mean")

This resamples a data frame with a datetime-like index such that all values within 3 seconds are aggregated into one row. The values of the columns are averaged.

这会使用类似日期时间的索引重新采样数据框,以便将 3 秒内的所有值聚合到一行中。列的值是平均的。

Question: I have a data frame with multiple columns. Is it possible to specify a different aggregation function for different columns, e.g. I want to "sum"column x, "mean"column yand pick the "last"for column z? How can I achieve that effect?

问题:我有一个包含多列的数据框。是否可以为不同的列指定不同的聚合函数,例如我想要"sum"column x"mean"columny并选择"last"for column z?我怎样才能达到这种效果?

I know I could create a new empty data frame, and then call resamplethree times, but I would prefer a faster in-place solution.

我知道我可以创建一个新的空数据框,然后调用resample3 次,但我更喜欢更快的就地解决方案。

回答by Scott Boston

You can use .aggafter resample. With a dictionary, you can aggregate different columns with various functions.

您可以.agg在重新采样后使用。使用字典,您可以聚合具有各种功能的不同列。

Try this:

尝试这个:

df.resample("3s").agg({'x':'sum','y':'mean','z':'last'})

Also, howis deprecated:

此外,how已弃用:

C:\Program Files\Anaconda3\lib\site-packages\ipykernel__main__.py:1: FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).mean()

C:\Program Files\Anaconda3\lib\site-packages\ipykernel__main__.py:1:FutureWarning:如何在 .resample() 中弃用新语法是 .resample(...).mean()

回答by piRSquared

Consider the dataframe df

考虑数据框 df

np.random.seed([3,1415])
tidx = pd.date_range('2017-01-01', periods=18, freq='S')
df = pd.DataFrame(np.random.rand(len(tidx), 3), tidx, list('XYZ'))
print(df)

                            X         Y         Z
2017-01-01 00:00:00  0.444939  0.407554  0.460148
2017-01-01 00:00:01  0.465239  0.462691  0.016545
2017-01-01 00:00:02  0.850445  0.817744  0.777962
2017-01-01 00:00:03  0.757983  0.934829  0.831104
2017-01-01 00:00:04  0.879891  0.926879  0.721535
2017-01-01 00:00:05  0.117642  0.145906  0.199844
2017-01-01 00:00:06  0.437564  0.100702  0.278735
2017-01-01 00:00:07  0.609862  0.085823  0.836997
2017-01-01 00:00:08  0.739635  0.866059  0.691271
2017-01-01 00:00:09  0.377185  0.225146  0.435280
2017-01-01 00:00:10  0.700900  0.700946  0.796487
2017-01-01 00:00:11  0.018688  0.700566  0.900749
2017-01-01 00:00:12  0.764869  0.253200  0.548054
2017-01-01 00:00:13  0.778883  0.651676  0.136097
2017-01-01 00:00:14  0.544838  0.035073  0.275079
2017-01-01 00:00:15  0.706685  0.713614  0.776050
2017-01-01 00:00:16  0.542329  0.836541  0.538186
2017-01-01 00:00:17  0.185523  0.652151  0.746060

Use agg

agg

df.resample('3S').agg(dict(X='sum', Y='mean', Z='last'))

                            X         Y         Z
2017-01-01 00:00:00  1.760624  0.562663  0.777962
2017-01-01 00:00:03  1.755516  0.669204  0.199844
2017-01-01 00:00:06  1.787061  0.350861  0.691271
2017-01-01 00:00:09  1.096773  0.542220  0.900749
2017-01-01 00:00:12  2.088590  0.313316  0.275079
2017-01-01 00:00:15  1.434538  0.734102  0.746060