具有任意精度(低至毫秒)的 Postgresql SQL GROUP BY 时间间隔

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12045600/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 23:41:27  来源:igfitidea点击:

Postgresql SQL GROUP BY time interval with arbitrary accuracy (down to milli seconds)

postgresqlgroup-by

提问by user1612798

I have my measurement data stored into the following structure:

我将测量数据存储到以下结构中:

CREATE TABLE measurements(
measured_at TIMESTAMPTZ,
val INTEGER
);

I alreadyknow that using

已经知道使用

(a) date_trunc('hour',measured_at)

(一种) date_trunc('hour',measured_at)

AND

(b) generate_series

(二) generate_series

I would be able to aggregate my data by:

我将能够通过以下方式汇总我的数据:

microseconds,
milliseconds
.
.
.

But is it possible to aggregate the data by 5 minutes or let's say an arbitrary amount of seconds? Is it possible to aggregate measured data by an arbitrary multiple of seconds?

但是是否可以将数据聚合 5 分钟或任意数量的秒数?是否可以以任意倍数的秒数聚合测量数据?

I need the data aggregated by different time resolutions to feed them into a FFT or an AR-Model in order to see possible seasonalities.

我需要按不同时间分辨率聚合的数据将它们输入 FFT 或 AR 模型,以便查看可能的季节性。

回答by Mike Sherrill 'Cat Recall'

You can generate a table of "buckets" by adding intervals created by generate_series(). This SQL statement will generate a table of five-minute buckets for the first day (the value of min(measured_at)) in your data.

您可以通过添加由 generate_series() 创建的间隔来生成“桶”表。此 SQL 语句将为min(measured_at)您的数据中的第一天( 的值)生成一个包含五分钟存储段的表。

select 
  (select min(measured_at)::date from measurements) + ( n    || ' minutes')::interval start_time,
  (select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
from generate_series(0, (24*60), 5) n

Wrap thatstatement in a common table expression, and you can join and group on it as if it were a base table.

将该语句包装在一个公用表表达式中,您可以将其连接和分组,就好像它是一个基表一样。

with five_min_intervals as (
  select 
    (select min(measured_at)::date from measurements) + ( n    || ' minutes')::interval start_time,
    (select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
  from generate_series(0, (24*60), 5) n
)
select f.start_time, f.end_time, avg(m.val) avg_val 
from measurements m
right join five_min_intervals f 
        on m.measured_at >= f.start_time and m.measured_at < f.end_time
group by f.start_time, f.end_time
order by f.start_time

Grouping by an arbitrary number of seconds is similar--use date_trunc().

按任意秒数分组是类似的——使用date_trunc()



A more general use of generate_series() lets you avoid guessing the upper limit for five-minute buckets. In practice, you'd probably build this as a view or a function. You might get better performance from a base table.

generate_series() 的更一般用途可让您避免猜测五分钟存储桶的上限。在实践中,您可能会将其构建为视图或函数。您可能会从基表中获得更好的性能。

select 
  (select min(measured_at)::date from measurements) + ( n    || ' minutes')::interval start_time,
  (select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
from generate_series(0, ((select max(measured_at)::date - min(measured_at)::date from measurements) + 1)*24*60, 5) n;

回答by Julian

Catcall has a great answer. My example of using it demonstrates having fixed buckets - in this case 30 minute intervals starting at midnight. It also shows that there can be one extra bucket generated in Catcall's first version and how to eliminate it. I wanted exactly 48 buckets in a day. In my problem, observations have separate date and time columns and I want to average the observations within a 30 minute period across the month for a number of different services.

Catcall 有一个很好的答案。我使用它的示例演示了固定存储桶 - 在这种情况下,从午夜开始间隔 30 分钟。它还表明在 Catcall 的第一个版本中可以生成一个额外的存储桶以及如何消除它。我一天刚好需要 48 个桶。在我的问题中,观察有单独的日期和时间列,我想在一个月内 30 分钟内为许多不同的服务平均观察。

with intervals as (
    select
        (n||' minutes')::interval as start_time, 
        ((n+30)|| ' minutes')::interval as end_time
    from generate_series(0, (23*60+30), 30) n
)
select i.start_time, o.service, avg(o.o)
from
observations o right join intervals i
on o.time >= i.start_time and o.time < i.end_time
where o.date between '2013-01-01' and '2013-01-31'
group by i.start_time, i.end_time, o.service
order by i.start_time

回答by grisha

How about

怎么样

SELECT MIN(val), 
EXTRACT(epoch FROM measured_at) / EXTRACT(epoch FROM INTERVAL '5 min') AS int 
FROM measurements 
GROUP BY int

where '5 min' can be any expression supported by INTERVAL

其中“5 min”可以是 INTERVAL 支持的任何表达式

回答by Chris Cogdon

The following will give you buckets of any size, even if they don't aline well with a nice minute/hour/whatever boundary. The value "300" is for a 5 minute grouping, but any value can be substituted:

以下内容将为您提供任何大小的桶,即使它们与不错的分钟/小时/任何边界不一致。值“300”用于 5 分钟分组,但可以替换任何值:

select measured_at, 
       val, 
       (date_trunc('seconds', (measured_at - timestamptz 'epoch') / 300) * 300 + timestamptz 'epoch') as aligned_measured_at
from measurements;

You can then use whatever aggregate you need around "val", and use "group by aligned_measured_at" as required.

然后,您可以在“val”周围使用您需要的任何聚合,并根据需要使用“group byaligned_measured_at”。

回答by Bill

This is based on Mike Sherrill's answer, except that it uses timestamp intervals instead of separate start/end columns.

这是基于 Mike Sherrill 的回答,除了它使用时间戳间隔而不是单独的开始/结束列。

with intervals as (
    select tstzrange(s, s + '5 minutes') das_interval
    from (select generate_series(min(lower(time_range)), max(upper(time_rage)), '5 minutes') s
          from your_table) x)
select das_interval, your_table.*
from   your_table
right join intervals on time_range && das_interval
order by das_interval;

回答by Barrett Clark

I wanted to look at the past 24 hours of data and count things in hourly increments. I started Cat Recall's solution, which is pretty slick. It's bound to the data, though, rather than just what's happened in the past 24H. So I refactored and ended up with something pretty close to Julian's solution, but with more CTE. So it's sort of the marriage of the 2 answers.

我想查看过去 24 小时的数据,并以每小时为增量进行计数。我开始使用 Cat Recall 的解决方案,它非常巧妙。但是,这与数据有关,而不仅仅是过去 24 小时内发生的事情。所以我重构并最终得到了与 Julian 的解决方案非常接近的东西,但具有更多的 CTE。所以这有点像两个答案的结合。

WITH interval_query AS (
    SELECT (ts ||' hour')::INTERVAL AS hour_interval
    FROM generate_series(0,23) AS ts
), time_series AS (
    SELECT date_trunc('hour', now()) + INTERVAL '60 min' * ROUND(date_part('minute', now()) / 60.0) - interval_query.hour_interval AS start_time
    FROM interval_query
), time_intervals AS (
    SELECT start_time, start_time + '1 hour'::INTERVAL AS end_time
    FROM time_series ORDER BY start_time
), reading_counts AS (
    SELECT f.start_time, f.end_time, br.minor, count(br.id) readings
    FROM beacon_readings br
    RIGHT JOIN time_intervals f
                    ON br.reading_timestamp >= f.start_time AND br.reading_timestamp < f.end_time AND br.major = 4
    GROUP BY f.start_time, f.end_time, br.minor
    ORDER BY f.start_time, br.minor
)
SELECT * FROM reading_counts

Note that any additional limiting I wanted in the final query needed to be done in the RIGHT JOIN. I'm not suggesting this is necessarily the best (or even a good approach), but it is something I'm running with (at least at the moment) in a dashboard.

请注意,我在最终查询中想要的任何额外限制都需要在RIGHT JOIN. 我并不是说这一定是最好的(甚至是一个好方法),但它是我在仪表板中运行的(至少目前是这样)。

回答by Bolster

I've taken a synthesis of all the above to try and come up with something slightly easier to use;

我综合了上述所有内容,尝试想出一些更容易使用的东西;

create or replace function interval_generator(start_ts timestamp with TIME ZONE, end_ts timestamp with TIME ZONE, round_interval INTERVAL)
    returns TABLE(start_time timestamp with TIME ZONE, end_time timestamp with TIME ZONE) as $$
BEGIN
return query
        SELECT
            (n)       start_time,
            (n + round_interval) end_time
        FROM generate_series(date_trunc('minute', start_ts), end_ts, round_interval) n;
END
$$
    LANGUAGE 'plpgsql';

This function is a timestamp abstraction of Mikes answer, which (IMO) makes things a little cleaner, especially if you're generating queries on the client end.

此函数是Mikes answer的时间戳抽象,它 (IMO) 使事情变得更清晰,尤其是当您在客户端生成查询时。

Also using an inner join gets rid of the sea of NULLs that appeared previously.

还使用内部连接摆脱了NULL之前出现的s之海。

with intervals as (select * from interval_generator(NOW() - INTERVAL '24 hours' , NOW(), '30 seconds'::INTERVAL))
select f.start_time, m.session_id, m.metric, min(m.value) min_val, avg(m.value) avg_val, max(m.value) max_val
from ts_combined as m
inner JOIN intervals f
    on m.time >= f.start_time and m.time < f.end_time
GROUP BY f.start_time, f.end_time, m.metric, m.session_id
ORDER BY f.start_time desc

(Also for my purposes I added in a few more aggregation fields)

(同样出于我的目的,我添加了更多聚合字段)

回答by Michael Krelin - hacker

Perhaps, you can extract(epoch from measured_at)and go from that?

也许,你可以extract(epoch from measured_at)从那开始?

回答by TomCB

The Timescale extensionfor PostgreSQL gives the ability to group by arbitrary time intervals. The function is called time_bucket()and has the same syntax as the date_trunc()function but takes an interval instead of a time precision as first parameter. Hereyou can find its API Docs. This is an example:

PostgreSQL的Timescale 扩展提供了按任意时间间隔分组的能力。该函数被调用time_bucket()并具有与date_trunc()函数相同的语法,但将间隔而不是时间精度作为第一个参数。在这里你可以找到它的 API 文档。这是一个例子:

SELECT
  time_bucket('5 minutes', observation_time) as bucket,
  device_id,
  avg(metric) as metric_avg,
  max(metric) - min(metric) as metric_spread
FROM
  device_readings
GROUP BY bucket, device_id;

You may also take a look at the continuous aggregate viewsif you want the 'grouped by an interval'views be updated automatically with new ingested data and if you want to query these views on a frequent basis. This can save you a lot of resources and will make your queries a lot faster.

如果您希望使用新摄取的数据自动更新“按间隔分组”视图,并且您希望频繁查询这些视图,您还可以查看连续聚合视图。这可以为您节省大量资源,并使您的查询速度更快。