SQL 在不使用 ROW_NUMBER() OVER 函数的情况下获取分区内行(等级)的序列号
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23425484/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get sequential number of a row (rank) within a partition without using ROW_NUMBER() OVER function
提问by Andrey Dmitriev
I need to rank rows by partition (or group), i.e. if my source table is:
我需要按分区(或组)对行进行排名,即如果我的源表是:
NAME PRICE
---- -----
AAA 1.59
AAA 2.00
AAA 0.75
BBB 3.48
BBB 2.19
BBB 0.99
BBB 2.50
I would like to get target table:
我想获得目标表:
RANK NAME PRICE
---- ---- -----
1 AAA 0.75
2 AAA 1.59
3 AAA 2.00
1 BBB 0.99
2 BBB 2.19
3 BBB 2.50
4 BBB 3.48
Normally I would use ROW_NUMBER() OVER
function, so in Apache Hive it would be:
通常我会使用ROW_NUMBER() OVER
函数,所以在 Apache Hive 中它会是:
select
row_number() over (partition by NAME order by PRICE) as RANK,
NAME,
PRICE
from
MY_TABLE
;
UnfortunatelyCloudera Impala does not support (at the moment) ROW_NUMBER() OVER
function, so I'm looking for a workaround. Preferably not to use UDAF, as it will be politically difficult to convince to deploy it to the server.
不幸的是Cloudera Impala 不支持(目前)ROW_NUMBER() OVER
功能,所以我正在寻找一种解决方法。最好不要使用 UDAF,因为说服将其部署到服务器在上是困难的。
Thank you for your help.
感谢您的帮助。
回答by Gordon Linoff
If you can't do it with a correlated subquery, you can still do this with a join:
如果您不能使用相关子查询来做到这一点,您仍然可以使用连接来做到这一点:
select t1.name, t1.price,
coalesce(count(t2.name) + 1, 1)
from my_table t1 join
my_table t2
on t2.name = t1.name and
t2.price < t1.price
order by t1.name, t1.price;
Note that this doesn't exactly do row_number()
unlessall the prices are distinct for a given name
. This formulation is actually equivalent to rank()
.
请注意,row_number()
除非所有价格对于给定的name
. 这个公式实际上等价于rank()
。
For row_number()
, you need a unique row identifier.
对于row_number()
,您需要一个唯一的行标识符。
By the way, the following is equivalent to dense_rank()
:
顺便说一下,以下等效于dense_rank()
:
select t1.name, t1.price,
coalesce(count(distinct t2.name) + 1, 1)
from my_table t1 join
my_table t2
on t2.name = t1.name and
t2.price < t1.price
order by t1.name, t1.price;
回答by a_horse_with_no_name
The usual workaround for systems not supporting window functions is something like this:
不支持窗口函数的系统通常的解决方法是这样的:
select name,
price,
(select count(*)
from my_table t2
where t2.name = t1.name -- this is the "partition by" replacement
and t2.price < t1.price) as row_number
from my_table t1
order by name, price;
SQLFiddle example: http://sqlfiddle.com/#!2/3b027/2
SQLFiddle 示例:http://sqlfiddle.com/#!2/3b027/2
回答by mhoglan
Not really an answer for how to with Impala, but there are other SQL on Hadoop solutions which do analytical and subquery options already. Without those capabilities you are probably going to have to rely on multi step process or some UDAF.
不是关于如何使用 Impala 的真正答案,但还有其他 SQL on Hadoop 解决方案已经可以进行分析和子查询选项。如果没有这些功能,您可能将不得不依赖多步骤流程或某些 UDAF。
I am an architect for InfiniDB
InfiniDB supports analytical functions and subqueries.
http://infinidb.co
我是 InfiniDB 的架构师
InfiniDB 支持分析功能和子查询。
http://infinidb.co
Check out Query 8 in the benchmark from Radiant Advisors, it is a similar style query that you are after, utilizing rank analytic function. Presto is also able to run this style query, just at a slower (80x) pace http://radiantadvisors.com/wp-content/uploads/2014/04/RadiantAdvisors_Benchmark_SQL-on-Hadoop_2014Q1.pdf
在 Radiant Advisors 的基准测试中查看查询 8,它是您所追求的类似样式的查询,它使用了排名分析函数。Presto 也能够运行这种样式查询,只是速度较慢(80 倍) http://radiantadvisors.com/wp-content/uploads/2014/04/RadiantAdvisors_Benchmark_SQL-on-Hadoop_2014Q1.pdf
The query from the benchmark (query 8)
来自基准测试的查询(查询 8)
SELECT
sub.visit_entry_idaction_url,
sub.name,
lv.referer_url,
sum(visit_ total_time) total_time,
count(sub.idvisit),
RANK () OVER (PARTITION BY sub. visit_entry_idaction_url
ORDER BY
count(sub.idvisit)) rank_by_visits,
DENSE_RANK() OVER (PARTITION BY sub.visit_entry_idaction_url
ORDER BY
count(visit_total_time)) rank_by_ time_spent
FROM
log_visit lv,
(
SELECT
visit_entry_idaction_url,
name,
idvisit
FROM
log_visit JOIN log_ action
ON
(visit_entry_idaction_url = log_action.idaction)
WHERE
visit_ entry_idaction_url between 2301400 AND
2302400) sub
WHERE
lv.idvisit = sub.idvisit
GROUP BY
1, 2, 3
ORDER BY
1, 6, 7;
Results
结果
Hive 0.12 Not Executable
Presto 0.57 506.84s
InfiniDB 4.0 6.37s
Impala 1.2 Not Executable