Pandas 按时间窗口分组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22769047/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:52:20  来源:igfitidea点击:

Pandas group by time windows

pythonpandas

提问by Kafonek

EDIT: Session generation from log file analysis with pandasseems to be exactly what I was looking for.

编辑:使用Pandas 从日志文件分析中生成会话似乎正是我正在寻找的。

I have a dataframe that includes non-unique time stamps, and I'd like to group them by time windows. The basic logic would be -

我有一个包含非唯一时间戳的数据框,我想按时间窗口对它们进行分组。基本逻辑是——

1) Create a time range from each time stamp by adding n minutes before and after the time stamp.

1) 通过在时间戳前后添加 n 分钟,从每个时间戳创建一个时间范围。

2) Group by time ranges that overlap. The end effect here would be that a time window would be as small as a single time stamp +/- the time buffer, but there is no cap on how large a time window could be, as long as multiple events were less distance apart than the time buffer

2) 按重叠的时间范围分组。这里的最终效果是时间窗口将与单个时间戳 +/- 时间缓冲区一样小,但时间窗口的大小没有上限,只要多个事件之间的距离小于时间缓冲

It feels like a df.groupby(pd.TimeGrouper(minutes=n)) is the right answer, but I don't know how to have the TimeGrouper create dynamic time ranges when it sees events that are within a time buffer.

感觉 df.groupby(pd.TimeGrouper(minutes=n)) 是正确的答案,但我不知道如何让 TimeGrouper 在看到时间缓冲区内的事件时创建动态时间范围。

For instance, if I try a TimeGrouper('20s') against a set of events: 10:34:00, 10:34:08, 10:34:08, 10:34:15, 10:34:28 and 10:34:54, then pandas will give me three groups (events falling between 10:34:00 - 10:34:20, 10:34:20 - 10:34:40, and 10:34:40-10:35:00). I would like to just get two groups back, 10:34:00 - 10:34:28, since there is no more than a 20 second gap between events in that time range, and a second group that is 10:34:54.

例如,如果我针对一组事件尝试 TimeGrouper('20s'):10:34:00、10:34:08、10:34:08、10:34:15、10:34:28 和 10 :34:54,然后pandas会给我三组(事件发生在10:34:00 - 10:34:20、10:34:20 - 10:34:40和10:34:40-10:35之间) :00)。我只想让两组返回,10:34:00 - 10:34:28,因为在那个时间范围内事件之间的间隔不超过 20 秒,第二组是 10:34:54 .

What is the best way to find temporal windows that are not static bins of time ranges?

找到不是静态时间范围的时间窗口的最佳方法是什么?

Given a Series that looks something like -

给定一个看起来像的系列 -

      time
0     2013-01-01 10:34:00+00:00
1     2013-01-01 10:34:12+00:00
2     2013-01-01 10:34:28+00:00
3     2013-01-01 10:34:54+00:00
4     2013-01-01 10:34:55+00:00
5     2013-01-01 10:35:19+00:00
6     2013-01-01 10:35:30+00:00

If I do a df.groupby(pd.TimeGrouper('20s')) on that Series, I would get back 5 group, 10:34:00-:20, :20-:40, :40-10:35:00, etc. What I want to do is have some function that creates elastic timeranges.. as long as events are within 20 seconds, expand the timerange. So I expect to get back -

如果我在那个系列上做一个 df.groupby(pd.TimeGrouper('20s')) ,我会回到 5 组,10:34:00-:20, :20-:40, :40-10:35: 00 等。我想要做的是有一些创建弹性时间范围的功能.. 只要事件在 20 秒内,扩大时间范围。所以我希望能回来——

2013-01-01 10:34:00 - 2013-01-01 10:34:48 
    0 2013-01-01 10:34:00+00:00
    1 2013-01-01 10:34:12+00:00
    2 2013-01-01 10:34:28+00:00

2013-01-01 10:34:54 - 2013-01-01 10:35:15
    3 2013-01-01 10:34:54+00:00
    4 2013-01-01 10:34:55+00:00

2013-01-01 10:35:19 - 2013-01-01 10:35:50
    5 2013-01-01 10:35:19+00:00
    6 2013-01-01 10:35:30+00:00

Thanks.

谢谢。

回答by Jeff

This is how to use to create a custom grouper. (requires pandas >= 0.13) for the timedelta computations, but otherwise would work in other versions.

这是创建自定义石斑鱼的方法。(需要 pandas >= 0.13)用于 timedelta 计算,但在其他版本中也可以使用。

Create your series

创建您的系列

In [31]: s = Series(range(6),pd.to_datetime(['20130101 10:34','20130101 10:34:08', '20130101 10:34:08', '20130101 10:34:15', '20130101 10:34:28', '20130101 10:34:54','20130101 10:34:55','20130101 10:35:12']))

In [32]: s
Out[32]: 
2013-01-01 10:34:00    0
2013-01-01 10:34:08    1
2013-01-01 10:34:08    2
2013-01-01 10:34:15    3
2013-01-01 10:34:28    4
2013-01-01 10:34:54    5
2013-01-01 10:34:55    6
2013-01-01 10:35:12    7
dtype: int64

This just computes the time difference in seconds between successive elements, but could actually be anything

这只是计算连续元素之间的时间差(以秒为单位),但实际上可以是任何东西

In [33]: indexer = s.index.to_series().order().diff().fillna(0).astype('timedelta64[s]')

In [34]: indexer
Out[34]: 
2013-01-01 10:34:00     0
2013-01-01 10:34:08     8
2013-01-01 10:34:08     0
2013-01-01 10:34:15     7
2013-01-01 10:34:28    13
2013-01-01 10:34:54    26
2013-01-01 10:34:55     1
2013-01-01 10:35:12    17
dtype: float64

Arbitrariy assign things < 20s to group 0, else to group 1. This could also be more arbitrary. if the diff from previous is < 0 BUT the total diff (from first) is > 50 make in group 2.

任意将小于 20 秒的事物分配给第 0 组,否则分配给第 1 组。这也可能更加随意。如果与之前的差异 < 0 但总差异(从第一个开始)在第 2 组中大于 50。

In [35]: grouper = indexer.copy()

In [36]: grouper[indexer<20] = 0

In [37]: grouper[indexer>20] = 1

In [95]: grouper[(indexer<20) & (indexer.cumsum()>50)] = 2

In [96]: grouper
Out[96]: 
2013-01-01 10:34:00    0
2013-01-01 10:34:08    0
2013-01-01 10:34:08    0
2013-01-01 10:34:15    0
2013-01-01 10:34:28    0
2013-01-01 10:34:54    1
2013-01-01 10:34:55    2
2013-01-01 10:35:12    2
dtype: float64

Groupem (can also use an apply here)

Groupem(也可以在此处使用 apply)

In [97]: s.groupby(grouper).sum()
Out[97]: 
0    10
1     5
2    13
dtype: int64

回答by Arseniy

You might want consider using apply:

您可能需要考虑使用apply

def my_grouper(datetime_value):
    return some_group(datetime_value)

df.groupby(df['date_time'].apply(my_grouper))

It's up to you to implement just any grouping logic in your grouper function. Btw, merging overlapping time ranges is kind of iterative task: for example, A = (0, 10), B = (20, 30), C = (10, 20). After C appears, all three, A, B and C should be merged.

您可以在 grouper 函数中实现任何分组逻辑。顺便说一句,合并重叠的时间范围是一种迭代任务:例如,A = (0, 10), B = (20, 30), C = (10, 20)。C出现后,A、B、C三者应该合并。

UPD:

更新:

This is my ugly version of merging algorithm:

这是我丑陋的合并算法版本:

groups = {}

def in_range(val, begin, end):
    return begin <= val <= end

global max_group_id
max_group_id = 1

def find_merged_group(begin, end):
    global max_group_id
    found_common_group = None
    full_wraps = []

    for (group_start, group_end), group in groups.iteritems():
        begin_inclusion = in_range(begin, group_start, group_end)
        end_inclusion = in_range(end, group_start, group_end)
        full_inclusion = begin_inclusion and end_inclusion
        full_wrap = not begin_inclusion and not end_inclusion and in_range(group_start, begin, end) and in_range(group_end, begin, end)
        if full_inclusion:
            groups[(begin, end)] = group
            return group
        if full_wrap:
            full_wraps.append(group)
        elif begin_inclusion or end_inclusion:
            if not found_common_group:
                 found_common_group = group
            else:  # merge
                for range, g in groups.iteritems():
                    if g == group:
                        groups[range] = found_common_group

    if not found_common_group:
        found_common_group = max_group_id
        max_group_id += 1
    groups[(begin, end)] = found_common_group
    return found_common_group

def my_grouper(date_time):
    return find_merged_group(date_time - 1, date_time + 1)

df['datetime'].apply(my_grouper) # first run to fill groups dict
grouped = df.groupby(df['datetime'].apply(my_grouper))  # this run is using already merged groups

回答by acushner

try this:

尝试这个:

  • create a column tsdiffthat has the diffs between consecutive times (using shift)
  • df['new_group'] = df.tsdiff > timedelta
  • fillnaon the new_group
  • groupbythat column
  • 创建一个tsdiff在连续时间之间具有差异的列(使用shift
  • df['new_group'] = df.tsdiff > timedelta
  • fillnanew_group
  • groupby那一栏

this is just really rough pseudocode, but the solution's in there somewhere...

这只是非常粗略的伪代码,但解决方案就在某处......