SQL 在不使用 ROW_NUMBER() OVER 函数的情况下获取分区内行（等级）的序列号

Question

提问by Andrey Dmitriev

I need to rank rows by partition (or group), i.e. if my source table is:

我需要按分区（或组）对行进行排名，即如果我的源表是：

NAME PRICE
---- -----
AAA  1.59
AAA  2.00
AAA  0.75
BBB  3.48
BBB  2.19
BBB  0.99
BBB  2.50

I would like to get target table:

我想获得目标表：

RANK NAME PRICE
---- ---- -----
1    AAA  0.75
2    AAA  1.59
3    AAA  2.00
1    BBB  0.99
2    BBB  2.19
3    BBB  2.50
4    BBB  3.48

Normally I would use ROW_NUMBER() OVERfunction, so in Apache Hive it would be:

通常我会使用ROW_NUMBER() OVER函数，所以在 Apache Hive 中它会是：

select
  row_number() over (partition by NAME order by PRICE) as RANK,
  NAME,
  PRICE
from
  MY_TABLE
;

UnfortunatelyCloudera Impala does not support (at the moment) ROW_NUMBER() OVERfunction, so I'm looking for a workaround. Preferably not to use UDAF, as it will be politically difficult to convince to deploy it to the server.

不幸的是Cloudera Impala 不支持（目前）ROW_NUMBER() OVER功能，所以我正在寻找一种解决方法。最好不要使用 UDAF，因为说服将其部署到服务器在上是困难的。

Thank you for your help.

感谢您的帮助。

Answer 1

回答by Gordon Linoff

If you can't do it with a correlated subquery, you can still do this with a join:

如果您不能使用相关子查询来做到这一点，您仍然可以使用连接来做到这一点：

select t1.name, t1.price,
       coalesce(count(t2.name) + 1, 1)
from my_table t1 join
     my_table t2
     on t2.name = t1.name and
        t2.price < t1.price
order by t1.name, t1.price;

Note that this doesn't exactly do row_number()unlessall the prices are distinct for a given name. This formulation is actually equivalent to rank().

请注意，row_number()除非所有价格对于给定的name. 这个公式实际上等价于rank()。

For row_number(), you need a unique row identifier.

对于row_number()，您需要一个唯一的行标识符。

By the way, the following is equivalent to dense_rank():

顺便说一下，以下等效于dense_rank()：

select t1.name, t1.price,
       coalesce(count(distinct t2.name) + 1, 1)
from my_table t1 join
     my_table t2
     on t2.name = t1.name and
        t2.price < t1.price
order by t1.name, t1.price;

Answer 2

回答by a_horse_with_no_name

The usual workaround for systems not supporting window functions is something like this:

不支持窗口函数的系统通常的解决方法是这样的：

select name, 
       price,
       (select count(*) 
        from my_table t2 
        where t2.name = t1.name  -- this is the "partition by" replacement
        and t2.price < t1.price) as row_number
from my_table t1
order by name, price;

SQLFiddle example: http://sqlfiddle.com/#!2/3b027/2

SQLFiddle 示例：http://sqlfiddle.com/#!2/3b027/2

Answer 3

回答by mhoglan

Not really an answer for how to with Impala, but there are other SQL on Hadoop solutions which do analytical and subquery options already. Without those capabilities you are probably going to have to rely on multi step process or some UDAF.

不是关于如何使用 Impala 的真正答案，但还有其他 SQL on Hadoop 解决方案已经可以进行分析和子查询选项。如果没有这些功能，您可能将不得不依赖多步骤流程或某些 UDAF。

I am an architect for InfiniDB
InfiniDB supports analytical functions and subqueries.
http://infinidb.co

我是 InfiniDB 的架构师
InfiniDB 支持分析功能和子查询。
http://infinidb.co

Check out Query 8 in the benchmark from Radiant Advisors, it is a similar style query that you are after, utilizing rank analytic function. Presto is also able to run this style query, just at a slower (80x) pace http://radiantadvisors.com/wp-content/uploads/2014/04/RadiantAdvisors_Benchmark_SQL-on-Hadoop_2014Q1.pdf

在 Radiant Advisors 的基准测试中查看查询 8，它是您所追求的类似样式的查询，它使用了排名分析函数。Presto 也能够运行这种样式查询，只是速度较慢（80 倍） http://radiantadvisors.com/wp-content/uploads/2014/04/RadiantAdvisors_Benchmark_SQL-on-Hadoop_2014Q1.pdf

The query from the benchmark (query 8)

来自基准测试的查询（查询 8）

SELECT
    sub.visit_entry_idaction_url,
    sub.name,
    lv.referer_url,
    sum(visit_ total_time) total_time,
    count(sub.idvisit),
    RANK () OVER (PARTITION BY sub. visit_entry_idaction_url
ORDER BY
    count(sub.idvisit)) rank_by_visits,
    DENSE_RANK() OVER (PARTITION BY sub.visit_entry_idaction_url
ORDER BY
    count(visit_total_time)) rank_by_ time_spent
FROM
    log_visit lv,
    (
SELECT
    visit_entry_idaction_url,
    name,
    idvisit
FROM
    log_visit JOIN log_ action
        ON
        (visit_entry_idaction_url = log_action.idaction)
WHERE
    visit_ entry_idaction_url between 2301400 AND
    2302400) sub
WHERE
    lv.idvisit = sub.idvisit
GROUP BY
    1, 2, 3
ORDER BY
    1, 6, 7;

Results

结果

Hive 0.12       Not Executable  
Presto 0.57     506.84s  
InfiniDB 4.0    6.37s  
Impala 1.2      Not Executable

SQL 在不使用 ROW_NUMBER() OVER 函数的情况下获取分区内行（等级）的序列号

提问by Andrey Dmitriev

回答by Gordon Linoff

回答by a_horse_with_no_name

回答by mhoglan

相关推荐

最近更新

标签

SQL 在不使用 ROW_NUMBER() OVER 函数的情况下获取分区内行（等级）的序列号

提问by Andrey Dmitriev

回答by Gordon Linoff

回答by a_horse_with_no_name

回答by mhoglan

相关推荐

SQL “MOD”不是可识别的内置函数名称

SQL 如何从T-SQL中的排序表中的第M行开始获取N行

将表附加到现有表：SQL Server

MS Sql: Conditional ORDER BY ASC/DESC 问题

相关推荐

最近更新

标签