使用 Pandas 对数据框进行时间分箱

Question

提问by Josh

I am trying to analyse average daily fluctuations in a measurement "X" over several weeks using pandas dataframes, however timestamps/datetimes etc. are proving particularly hellish to deal with. Having spent a good few hours trying to work this out my code is getting messier and messier and I don't think I'm any closer to a solution, hoping someone here can guide me in the right direction.

我正在尝试使用 Pandas 数据帧分析数周内测量“X”的平均每日波动，但是时间戳/日期时间等被证明特别难以处理。花了好几个小时试图解决这个问题后，我的代码变得越来越混乱，我认为我离解决方案更近了，希望这里有人可以指导我朝着正确的方向前进。

I have measured X at different times and on different days, saving the daily results to a dataframe which has the form:

我在不同的时间和不同的日子测量了 X，将每日结果保存到具有以下形式的数据框：

    Timestamp(datetime64)         X 

0    2015-10-05 00:01:38          1
1    2015-10-05 06:03:39          4 
2    2015-10-05 13:42:39          3
3    2015-10-05 22:15:39          2

As the time the measurement is made at changes from day to day I decided to use binning to organise the data, and then work out averages and STD for each bin which I can then plot. My idea was to create a final dataframe with bins and the average value of X for the measurements, the 'Observations' column is just to aid understanding:

由于测量的时间每天都在变化，我决定使用分箱来组织数据，然后计算出每个分箱的平均值和 STD，然后我可以绘制它们。我的想法是创建一个带有 bin 的最终数据框和测量值 X 的平均值，“观察”列只是为了帮助理解：

        Time Bin       Observations     <X>  

0     00:00-05:59      [ 1 , ...]       2.3
1     06:00-11:59      [ 4 , ...]       4.6
2     12:00-17:59      [ 3 , ...]       8.5
3     18:00-23:59      [ 2 , ...]       3.1

However I've run into difficulties with incompatibility between time, datetime, datetime64, timedelta and binning using pd.cut and pd.groupby, basically I feel like I'm making stabs in the dark with no idea as to the the 'right' way to approach this problem. The only solution I can think of is a row-by-row iteration through the dataframe but I'd really like to avoid having to do this.

但是，我在使用 pd.cut 和 pd.groupby 时遇到了时间、日期时间、datetime64、timedelta 和 binning 之间不兼容的困难，基本上我觉得我在黑暗中刺伤，不知道“正确”解决这个问题的方法。我能想到的唯一解决方案是通过数据帧逐行迭代，但我真的很想避免这样做。

Answer 1

回答by tnknepp

Whenever I bin time series data by a time range, which seems to be what you are doing here, I just create an "hour of day" column and slice over that. Also, I normally set the index as datetime values...though that is not necessary here.

每当我按时间范围对时间序列数据进行分组时，这似乎就是您在这里所做的，我只是创建一个“一天中的小时”列并对其进行切片。另外，我通常将索引设置为日期时间值……尽管这里没有必要。

# assuming your "timestamp" column is labeled ts: 
df['hod'] = [r.hour for r in df.ts]

# now you can calculate stats for each bin
ave = df[ (df.hod>=0) & (df.hod<6) ].mean()

I would think there is a method of using df.resample here, but with the poorly defined starting/ending points in your time series I think this may require more attention than the above method.

我认为这里有一种使用 df.resample 的方法，但是由于时间序列中的起点/终点定义不明确，我认为这可能需要比上述方法更多的关注。

Is this along the lines of what you were wanting?

这与您想要的一致吗？

Answer 2

回答by Mathiou

Not sure I have the best answer but I think it works anyway.
First, I would convert the datetime64to datetimeusing this post for example : Converting between datetime, Timestamp and datetime64

不确定我有最好的答案，但我认为它无论如何都有效。
首先，我要转换datetime64到datetime使用这个职位，例如：日期时间，时间戳和datetime64之间的转换

Then, if we assume that your first column has datetimeand is called TimeStamp, I would do something like this :

然后，如果我们假设你的第一列有datetime并且被称为TimeStamp，我会做这样的事情：

def bin_f(x):
    if x.time() < datetime.time(6):
        return "00:00-05:59"
    elif x.time() < datetime.time(12):
        return "06:00-11:59"
    elif x.time() < datetime.time(18):
        return "12:00-17:59"
    else:
        return "18:00-23:59"

df["Bin"] = df["TimeStamp"].apply(bin_f)
grouped = df.groupby("Bin")
grouped['X'].agg(np.std)

With Xbeing the name of your column.

随着X被你列的名称。

Answer 3

回答by Eugene

I found Mathiou's response helpful for my purpose, but modified it as follows:

我发现 Mathiou 的回答对我的目的有帮助，但对其进行了如下修改：

def bin_f(x):
    h = x.time()
    if h < 6:
        return "00:00-05:59"
    elif h < 12:
        return "06:00-11:59"
    elif h < 18:
        return "12:00-17:59"
    else:
        return "18:00-23:59"

Answer 4

回答by wplanutis

You can use between_timefunction to obtain this. Remember that this function works on DataFrame so you have to set index to DatetimeIndex first.

您可以使用between_time函数来获取它。请记住，此函数适用于 DataFrame，因此您必须先将索引设置为 DatetimeIndex。

使用 Pandas 对数据框进行时间分箱

提问by Josh

回答by tnknepp

回答by Mathiou

回答by Eugene

回答by wplanutis

相关推荐

最近更新

标签

使用 Pandas 对数据框进行时间分箱

提问by Josh

回答by tnknepp

回答by Mathiou

回答by Eugene

回答by wplanutis

相关推荐

Pandas DataFrame 中哪些列是二进制的？

堆叠 Pandas DataFrame 时设置列名

将 Pandas 时间戳转换为时间（寻找比 .apply 更快的东西）

Pyinstaller 和 Pandas 的导入错误

相关推荐

最近更新

标签