如何使用 Pandas 的时间戳按小时对数据帧进行分组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49157048/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to group dataframe by hour using timestamp with Pandas
提问by Franco
I have the following dataframe structure that is indexed with a timestamp:
我有以下用时间戳索引的数据帧结构:
neg neu norm pol pos date
time
1520353341 0.000 1.000 0.0000 0.000000 0.000
1520353342 0.121 0.879 -0.2960 0.347851 0.000
1520353342 0.217 0.783 -0.6124 0.465833 0.000
I create a date from the timestamp:
我从时间戳创建一个日期:
data_frame['date'] = [datetime.datetime.fromtimestamp(d) for d in data_frame.time]
Result:
结果:
neg neu norm pol pos date
time
1520353341 0.000 1.000 0.0000 0.000000 0.000 2018-03-06 10:22:21
1520353342 0.121 0.879 -0.2960 0.347851 0.000 2018-03-06 10:22:22
1520353342 0.217 0.783 -0.6124 0.465833 0.000 2018-03-06 10:22:22
I want to group by hour, while getting the meanfor all the values, except the timestamp, that should be the hour from where the group started. So this is the result I want to archive:
我想按小时,而得到平均的所有值,除了时间戳,这应该是从小组开始,其中小时。所以这是我要存档的结果:
neg neu norm pol pos
time
1520352000 0.027989 0.893233 0.122535 0.221079 0.078779
1520355600 0.028861 0.899321 0.103698 0.209353 0.071811
The closest I have gotten so far has been with this answer:
到目前为止,我得到的最接近的是这个答案:
data = data.groupby(data.date.dt.hour).mean()
Results:
结果:
neg neu norm pol pos
date
0 0.027989 0.893233 0.122535 0.221079 0.078779
1 0.028861 0.899321 0.103698 0.209353 0.071811
But I cant figure out how to keep the timestamp that takes in account he hour where the grouby started.
但我无法弄清楚如何保留考虑到 grouby 开始的时间的时间戳。
回答by Jordi
I came across this gem, pd.DataFrame.resample
, after I posted my round-to-hour solution.
pd.DataFrame.resample
在我发布了我的整小时解决方案之后,我遇到了这个 gem 。
# Construct example dataframe
times = pd.date_range('1/1/2018', periods=5, freq='25min')
values = [4,8,3,4,1]
df = pd.DataFrame({'val':values}, index=times)
# Resample by hour and calculate medians
df.resample('H').median()
Or you can use groupby
with Grouper
if you don't want times as index:
或者groupby
,Grouper
如果您不想将时间用作索引,则可以使用with :
df = pd.DataFrame({'val':values, 'times':times})
df.groupby(pd.Grouper(level='times', freq='H')).median()
回答by Jordi
You can round the timestamp column down to the nearest hour:
您可以将时间戳列向下舍入到最接近的小时:
import math
df.time = [math.floor(t/3600) * 3600 for t in df.time]
Or even simpler, using integer division:
或者更简单,使用整数除法:
df.time = [(t//3600) * 3600 for t in df.time]
You can group by this column and thus preserve the timestamp.
您可以按此列分组,从而保留时间戳。
回答by Connor John
Did you try creating an hour column by:
您是否尝试通过以下方式创建小时列:
data_frame['hour'] = data_frame.date.dt.hour
Then grouping by hour like:
然后按小时分组,如:
data = data.groupby(data.hour).mean()