在 Oracle 中创建直方图/频率分布的最佳方法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6201992/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Optimal way to create a histogram/frequency distribution in Oracle?
提问by matt b
I have an events
table with two columns eventkey
(unique, primary-key) and createtime
, which stores the creation time of the event as the number of milliseconds since Jan 1 1970 in a NUMBER
column.
我有一个events
包含两列eventkey
(唯一,主键)和的表createtime
,它将事件的创建时间存储为一NUMBER
列中自 1970 年 1 月 1 日以来的毫秒数。
I would like to create a "histogram" or frequency distribution that shows me how many events were created in each hour of the past week.
我想创建一个“直方图”或频率分布,显示在过去一周的每个小时内创建了多少事件。
Is this the best way to write such a query in Oracle, using the width_bucket()
function? Is it possible to derive the number of rows that fall into each bucket using one of the other Oracle analytic functions rather than using width_bucket
to determine what bucket number each row belongs to and doing a count(*)
over that?
这是使用该width_bucket()
函数在 Oracle 中编写此类查询的最佳方法吗?是否可以使用其他 Oracle 分析函数之一推导出落入每个存储桶的行数,而不是使用width_bucket
确定每行属于哪个存储桶编号并对其进行count(*)
处理?
-- 1305504000000 = 5/16/2011 12:00am GMT
-- 1306108800000 = 5/23/2011 12:00am GMT
select
timestamp '1970-01-01 00:00:00' + numtodsinterval((1305504000000/1000 + (bucket * 60 * 60)), 'second') period_start,
numevents
from (
select bucket, count(*) as events from (
select eventkey, createtime,
width_bucket(createtime, 1305504000000, 1306108800000, 24 * 7) bucket
from events
where createtime between 1305504000000 and 1306108800000
) group by bucket
)
order by period_start
回答by Adam Musch
If your createtime
were a date column, this would be trivial:
如果你createtime
是一个日期列,这将是微不足道的:
SELECT TO_CHAR(CREATE_TIME, 'DAY:HH24'), COUNT(*)
FROM EVENTS
GROUP BY TO_CHAR(CREATE_TIME, 'DAY:HH24');
As it is, casting the createtime
column isn't too hard:
事实上,铸造createtime
列并不太难:
select TO_CHAR(
TO_DATE('19700101', 'YYYYMMDD') + createtime / 86400000),
'DAY:HH24') AS BUCKET, COUNT(*)
FROM EVENTS
WHERE createtime between 1305504000000 and 1306108800000
group by TO_CHAR(
TO_DATE('19700101', 'YYYYMMDD') + createtime / 86400000),
'DAY:HH24')
order by 1
If, alternatively, you're looking for the fencepost values (for example, where do I go from the first decile (0-10%) to the next (11-20%), you'd do something like:
或者,如果您正在寻找围栏值(例如,我从第一个十分位数 (0-10%) 到下一个十分位数 (11-20%) 的位置),您可以执行以下操作:
select min(createtime) over (partition by decile) as decile_start,
max(createtime) over (partition by decile) as decile_end,
decile
from (select createtime,
ntile (10) over (order by createtime asc) as decile
from events
where createtime between 1305504000000 and 1306108800000
)
回答by Denis de Bernardy
I'm unfamiliar with Oracle's date functions, but I'm pretty certain there's an equivalent way of writing this Postgres statement:
我不熟悉 Oracle 的日期函数,但我很确定有一种等效的方式来编写此 Postgres 语句:
select date_trunc('hour', stamp), count(*)
from your_data
group by date_trunc('hour', stamp)
order by date_trunc('hour', stamp)
回答by Craig
Pretty much the same response as Adam, but I would prefer to keep the period_start as a time field so it is easier to filter further if needed:
与 Adam 的响应几乎相同,但我更愿意将 period_start 保留为时间字段,以便在需要时更容易进一步过滤:
with
events as
(
select rownum eventkey, round(dbms_random.value(1305504000000, 1306108800000)) createtime
from dual
connect by level <= 1000
)
select
trunc(timestamp '1970-01-01 00:00:00' + numtodsinterval(createtime/1000, 'second'), 'HH') period_start,
count(*) numevents
from
events
where
createtime between 1305504000000 and 1306108800000
group by
trunc(timestamp '1970-01-01 00:00:00' + numtodsinterval(createtime/1000, 'second'), 'HH')
order by
period_start
回答by hychou
Using oracle provided function "WIDTH_BUCKET" to accumulate continuous or fine-discrete data. The following example shows a way to create a histogram with 5 buckets and gather "COLUMN_VALUE" from 510 to 520 (so each bucket gets values of range 2). WIDTH_BUCKET will create additional id=0 and num_buckets+1 buckets for values below min and above max.
使用 oracle 提供的函数“ WIDTH_BUCKET”来累积连续或精细离散的数据。以下示例显示了一种创建具有 5 个桶的直方图并收集从 510 到 520 的“COLUMN_VALUE”(因此每个桶获得范围 2 的值)的方法。WIDTH_BUCKET 将为低于 min 和高于 max 的值创建额外的 id=0 和 num_buckets+1 存储桶。
SELECT "BUCKET_ID", count(*),
CASE
WHEN "BUCKET_ID"=0 THEN -1/0F
ELSE 510+(520-510)/5*("BUCKET_ID"-1)
END "BUCKET_MIN",
CASE
WHEN "BUCKET_ID"=5+1 THEN 1/0F
ELSE 510+(520-510)/5*("BUCKET_ID")
END "BUCKET_MAX"
FROM
(
SELECT "COLUMN_VALUE",
WIDTH_BUCKET("COLUMN_VALUE", 510, 520, 5) "BUCKET_ID"
FROM "MY_TABLE"
)
group by "BUCKET_ID"
ORDER BY "BUCKET_ID";
Sample output
样本输出
BUCKET_ID COUNT(*) BUCKET_MIN BUCKET_MAX
---------- ---------- ---------- ----------
0 45 -Inf 5.1E+002
1 220 5.1E+002 5.12E+002
2 189 5.12E+002 5.14E+002
3 43 5.14E+002 5.16E+002
4 3 5.16E+002 5.18E+002
In my table, there's no 518-520, so bucket with id=5 is not shown. On the other hand, there's values below min (510), so there's a bucket with id=0, gathering -inf to 510 values.
在我的表中,没有 518-520,所以没有显示 id=5 的桶。另一方面,有低于 min (510) 的值,因此有一个 id=0 的存储桶,将 -inf 收集到 510 个值。