使用 Pandas 对数据框进行时间分箱

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33151463/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:02:46  来源:igfitidea点击:

Time binning dataframe with pandas

pythonpandas

提问by Josh

I am trying to analyse average daily fluctuations in a measurement "X" over several weeks using pandas dataframes, however timestamps/datetimes etc. are proving particularly hellish to deal with. Having spent a good few hours trying to work this out my code is getting messier and messier and I don't think I'm any closer to a solution, hoping someone here can guide me in the right direction.

我正在尝试使用 Pandas 数据帧分析数周内测量“X”的平均每日波动,但是时间戳/日期时间等被证明特别难以处理。花了好几个小时试图解决这个问题后,我的代码变得越来越混乱,我认为我离解决方案更近了,希望这里有人可以指导我朝着正确的方向前进。

I have measured X at different times and on different days, saving the daily results to a dataframe which has the form:

我在不同的时间和不同的日子测量了 X,将每日结果保存到具有以下形式的数据框:

    Timestamp(datetime64)         X 

0    2015-10-05 00:01:38          1
1    2015-10-05 06:03:39          4 
2    2015-10-05 13:42:39          3
3    2015-10-05 22:15:39          2

As the time the measurement is made at changes from day to day I decided to use binning to organise the data, and then work out averages and STD for each bin which I can then plot. My idea was to create a final dataframe with bins and the average value of X for the measurements, the 'Observations' column is just to aid understanding:

由于测量的时间每天都在变化,我决定使用分箱来组织数据,然后计算出每个分箱的平均值和 STD,然后我可以绘制它们。我的想法是创建一个带有 bin 的最终数据框和测量值 X 的平均值,“观察”列只是为了帮助理解:

        Time Bin       Observations     <X>  

0     00:00-05:59      [ 1 , ...]       2.3
1     06:00-11:59      [ 4 , ...]       4.6
2     12:00-17:59      [ 3 , ...]       8.5
3     18:00-23:59      [ 2 , ...]       3.1

However I've run into difficulties with incompatibility between time, datetime, datetime64, timedelta and binning using pd.cut and pd.groupby, basically I feel like I'm making stabs in the dark with no idea as to the the 'right' way to approach this problem. The only solution I can think of is a row-by-row iteration through the dataframe but I'd really like to avoid having to do this.

但是,我在使用 pd.cut 和 pd.groupby 时遇到了时间、日期时间、datetime64、timedelta 和 binning 之间不兼容的困难,基本上我觉得我在黑暗中刺伤,不知道“正确”解决这个问题的方法。我能想到的唯一解决方案是通过数据帧逐行迭代,但我真的很想避免这样做。

回答by tnknepp

Whenever I bin time series data by a time range, which seems to be what you are doing here, I just create an "hour of day" column and slice over that. Also, I normally set the index as datetime values...though that is not necessary here.

每当我按时间范围对时间序列数据进行分组时,这似乎就是您在这里所做的,我只是创建一个“一天中的小时”列并对其进行切片。另外,我通常将索引设置为日期时间值……尽管这里没有必要。

# assuming your "timestamp" column is labeled ts: 
df['hod'] = [r.hour for r in df.ts]

# now you can calculate stats for each bin
ave = df[ (df.hod>=0) & (df.hod<6) ].mean()

I would think there is a method of using df.resample here, but with the poorly defined starting/ending points in your time series I think this may require more attention than the above method.

我认为这里有一种使用 df.resample 的方法,但是由于时间序列中的起点/终点定义不明确,我认为这可能需要比上述方法更多的关注。

Is this along the lines of what you were wanting?

这与您想要的一致吗?

回答by Mathiou

Not sure I have the best answer but I think it works anyway.
First, I would convert the datetime64to datetimeusing this post for example : Converting between datetime, Timestamp and datetime64

不确定我有最好的答案,但我认为它无论如何都有效。
首先,我要转换datetime64datetime使用这个职位,例如: 日期时间,时间戳和datetime64之间的转换

Then, if we assume that your first column has datetimeand is called TimeStamp, I would do something like this :

然后,如果我们假设你的第一列有datetime并且被称为TimeStamp,我会做这样的事情:

def bin_f(x):
    if x.time() < datetime.time(6):
        return "00:00-05:59"
    elif x.time() < datetime.time(12):
        return "06:00-11:59"
    elif x.time() < datetime.time(18):
        return "12:00-17:59"
    else:
        return "18:00-23:59"

df["Bin"] = df["TimeStamp"].apply(bin_f)
grouped = df.groupby("Bin")
grouped['X'].agg(np.std)

With Xbeing the name of your column.

随着X被你列的名称。

回答by Eugene

I found Mathiou's response helpful for my purpose, but modified it as follows:

我发现 Mathiou 的回答对我的目的有帮助,但对其进行了如下修改:

def bin_f(x):
    h = x.time()
    if h < 6:
        return "00:00-05:59"
    elif h < 12:
        return "06:00-11:59"
    elif h < 18:
        return "12:00-17:59"
    else:
        return "18:00-23:59"

回答by wplanutis

You can use between_timefunction to obtain this. Remember that this function works on DataFrame so you have to set index to DatetimeIndex first.

您可以使用between_time函数来获取它。请记住,此函数适用于 DataFrame,因此您必须先将索引设置为 DatetimeIndex。