Pandas 按时间窗口分组

Question

提问by Kafonek

EDIT: Session generation from log file analysis with pandasseems to be exactly what I was looking for.

I have a dataframe that includes non-unique time stamps, and I'd like to group them by time windows. The basic logic would be -

我有一个包含非唯一时间戳的数据框，我想按时间窗口对它们进行分组。基本逻辑是——

1) Create a time range from each time stamp by adding n minutes before and after the time stamp.

1) 通过在时间戳前后添加 n 分钟，从每个时间戳创建一个时间范围。

2) Group by time ranges that overlap. The end effect here would be that a time window would be as small as a single time stamp +/- the time buffer, but there is no cap on how large a time window could be, as long as multiple events were less distance apart than the time buffer

2) 按重叠的时间范围分组。这里的最终效果是时间窗口将与单个时间戳 +/- 时间缓冲区一样小，但时间窗口的大小没有上限，只要多个事件之间的距离小于时间缓冲

It feels like a df.groupby(pd.TimeGrouper(minutes=n)) is the right answer, but I don't know how to have the TimeGrouper create dynamic time ranges when it sees events that are within a time buffer.

感觉 df.groupby(pd.TimeGrouper(minutes=n)) 是正确的答案，但我不知道如何让 TimeGrouper 在看到时间缓冲区内的事件时创建动态时间范围。

For instance, if I try a TimeGrouper('20s') against a set of events: 10:34:00, 10:34:08, 10:34:08, 10:34:15, 10:34:28 and 10:34:54, then pandas will give me three groups (events falling between 10:34:00 - 10:34:20, 10:34:20 - 10:34:40, and 10:34:40-10:35:00). I would like to just get two groups back, 10:34:00 - 10:34:28, since there is no more than a 20 second gap between events in that time range, and a second group that is 10:34:54.

例如，如果我针对一组事件尝试 TimeGrouper('20s')：10:34:00、10:34:08、10:34:08、10:34:15、10:34:28 和 10 :34:54，然后pandas会给我三组（事件发生在10:34:00 - 10:34:20、10:34:20 - 10:34:40和10:34:40-10:35之间） :00)。我只想让两组返回，10:34:00 - 10:34:28，因为在那个时间范围内事件之间的间隔不超过 20 秒，第二组是 10:34:54 .

What is the best way to find temporal windows that are not static bins of time ranges?

找到不是静态时间范围的时间窗口的最佳方法是什么？

Given a Series that looks something like -

给定一个看起来像的系列 -

      time
0     2013-01-01 10:34:00+00:00
1     2013-01-01 10:34:12+00:00
2     2013-01-01 10:34:28+00:00
3     2013-01-01 10:34:54+00:00
4     2013-01-01 10:34:55+00:00
5     2013-01-01 10:35:19+00:00
6     2013-01-01 10:35:30+00:00

If I do a df.groupby(pd.TimeGrouper('20s')) on that Series, I would get back 5 group, 10:34:00-:20, :20-:40, :40-10:35:00, etc. What I want to do is have some function that creates elastic timeranges.. as long as events are within 20 seconds, expand the timerange. So I expect to get back -

如果我在那个系列上做一个 df.groupby(pd.TimeGrouper('20s')) ，我会回到 5 组，10:34:00-:20, :20-:40, :40-10:35: 00 等。我想要做的是有一些创建弹性时间范围的功能.. 只要事件在 20 秒内，扩大时间范围。所以我希望能回来——

2013-01-01 10:34:00 - 2013-01-01 10:34:48 
    0 2013-01-01 10:34:00+00:00
    1 2013-01-01 10:34:12+00:00
    2 2013-01-01 10:34:28+00:00

2013-01-01 10:34:54 - 2013-01-01 10:35:15
    3 2013-01-01 10:34:54+00:00
    4 2013-01-01 10:34:55+00:00

2013-01-01 10:35:19 - 2013-01-01 10:35:50
    5 2013-01-01 10:35:19+00:00
    6 2013-01-01 10:35:30+00:00

Thanks.

谢谢。

Answer 1

回答by Jeff

This is how to use to create a custom grouper. (requires pandas >= 0.13) for the timedelta computations, but otherwise would work in other versions.

这是创建自定义石斑鱼的方法。（需要 pandas >= 0.13）用于 timedelta 计算，但在其他版本中也可以使用。

Create your series

创建您的系列

In [31]: s = Series(range(6),pd.to_datetime(['20130101 10:34','20130101 10:34:08', '20130101 10:34:08', '20130101 10:34:15', '20130101 10:34:28', '20130101 10:34:54','20130101 10:34:55','20130101 10:35:12']))

In [32]: s
Out[32]: 
2013-01-01 10:34:00    0
2013-01-01 10:34:08    1
2013-01-01 10:34:08    2
2013-01-01 10:34:15    3
2013-01-01 10:34:28    4
2013-01-01 10:34:54    5
2013-01-01 10:34:55    6
2013-01-01 10:35:12    7
dtype: int64

This just computes the time difference in seconds between successive elements, but could actually be anything

这只是计算连续元素之间的时间差（以秒为单位），但实际上可以是任何东西

In [33]: indexer = s.index.to_series().order().diff().fillna(0).astype('timedelta64[s]')

In [34]: indexer
Out[34]: 
2013-01-01 10:34:00     0
2013-01-01 10:34:08     8
2013-01-01 10:34:08     0
2013-01-01 10:34:15     7
2013-01-01 10:34:28    13
2013-01-01 10:34:54    26
2013-01-01 10:34:55     1
2013-01-01 10:35:12    17
dtype: float64

Arbitrariy assign things < 20s to group 0, else to group 1. This could also be more arbitrary. if the diff from previous is < 0 BUT the total diff (from first) is > 50 make in group 2.

任意将小于 20 秒的事物分配给第 0 组，否则分配给第 1 组。这也可能更加随意。如果与之前的差异 < 0 但总差异（从第一个开始）在第 2 组中大于 50。

In [35]: grouper = indexer.copy()

In [36]: grouper[indexer<20] = 0

In [37]: grouper[indexer>20] = 1

In [95]: grouper[(indexer<20) & (indexer.cumsum()>50)] = 2

In [96]: grouper
Out[96]: 
2013-01-01 10:34:00    0
2013-01-01 10:34:08    0
2013-01-01 10:34:08    0
2013-01-01 10:34:15    0
2013-01-01 10:34:28    0
2013-01-01 10:34:54    1
2013-01-01 10:34:55    2
2013-01-01 10:35:12    2
dtype: float64

Groupem (can also use an apply here)

Groupem（也可以在此处使用 apply）

In [97]: s.groupby(grouper).sum()
Out[97]: 
0    10
1     5
2    13
dtype: int64

Answer 2

回答by Arseniy

You might want consider using apply:

您可能需要考虑使用apply：

def my_grouper(datetime_value):
    return some_group(datetime_value)

df.groupby(df['date_time'].apply(my_grouper))

It's up to you to implement just any grouping logic in your grouper function. Btw, merging overlapping time ranges is kind of iterative task: for example, A = (0, 10), B = (20, 30), C = (10, 20). After C appears, all three, A, B and C should be merged.

您可以在 grouper 函数中实现任何分组逻辑。顺便说一句，合并重叠的时间范围是一种迭代任务：例如，A = (0, 10), B = (20, 30), C = (10, 20)。C出现后，A、B、C三者应该合并。

UPD:

更新：

This is my ugly version of merging algorithm:

这是我丑陋的合并算法版本：

groups = {}

def in_range(val, begin, end):
    return begin <= val <= end

global max_group_id
max_group_id = 1

def find_merged_group(begin, end):
    global max_group_id
    found_common_group = None
    full_wraps = []

    for (group_start, group_end), group in groups.iteritems():
        begin_inclusion = in_range(begin, group_start, group_end)
        end_inclusion = in_range(end, group_start, group_end)
        full_inclusion = begin_inclusion and end_inclusion
        full_wrap = not begin_inclusion and not end_inclusion and in_range(group_start, begin, end) and in_range(group_end, begin, end)
        if full_inclusion:
            groups[(begin, end)] = group
            return group
        if full_wrap:
            full_wraps.append(group)
        elif begin_inclusion or end_inclusion:
            if not found_common_group:
                 found_common_group = group
            else:  # merge
                for range, g in groups.iteritems():
                    if g == group:
                        groups[range] = found_common_group

    if not found_common_group:
        found_common_group = max_group_id
        max_group_id += 1
    groups[(begin, end)] = found_common_group
    return found_common_group

def my_grouper(date_time):
    return find_merged_group(date_time - 1, date_time + 1)

df['datetime'].apply(my_grouper) # first run to fill groups dict
grouped = df.groupby(df['datetime'].apply(my_grouper))  # this run is using already merged groups

Answer 3

回答by acushner

try this:

尝试这个：

create a column tsdiffthat has the diffs between consecutive times (using shift)
df['new_group'] = df.tsdiff > timedelta
fillnaon the new_group
groupbythat column

创建一个tsdiff在连续时间之间具有差异的列（使用shift）
df['new_group'] = df.tsdiff > timedelta
fillna在 new_group
groupby那一栏

this is just really rough pseudocode, but the solution's in there somewhere...

这只是非常粗略的伪代码，但解决方案就在某处......

Pandas 按时间窗口分组

提问by Kafonek

回答by Jeff

回答by Arseniy

回答by acushner

相关推荐

最近更新

标签

Pandas 按时间窗口分组

提问by Kafonek

回答by Jeff

回答by Arseniy

回答by acushner

相关推荐

ValueError：未转换的数据保留在 Pandas DataFrame 上

在 Pandas DataFrame 中外推值

pandas 过滤数据框的熊猫直方图

pandas 在熊猫中标记变量？

相关推荐

最近更新

标签