hive sql 查找最新记录

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13523049/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 12:17:10  来源:igfitidea点击:

hive sql find the latest record

sqlgroup-byhivemax

提问by qiulp

the table is:

该表是:

create table test (
id string,
name string,
age string,
modified string)

data like this:

像这样的数据:

id    name   age  modifed
1     a      10   2011-11-11 11:11:11
1     a      11   2012-11-11 12:00:00
2     b      20   2012-12-10 10:11:12
2     b      20   2012-12-10 10:11:12
2     b      20   2012-12-12 10:11:12
2     b      20   2012-12-15 10:11:12

I want to get the latest record(include every colums id,name,age,modifed) group by id,as the data above,the correct result is:

我想按id获取最新的记录(包括每个列的id,name,age,modifed),如上面的数据,正确的结果是:

1     a      11   2012-11-11 12:00:00
2     b      20   2012-12-15 10:11:12

I do like this:

我喜欢这样:

insert overwrite table t 
select b.id, b.name, b.age, b.modified 
from (
        select id,max(modified) as modified 
        from test 
        group by id
) a 
left outer join test b on (a.id=b.id  and a.modified=b.modified);

This sql can get the right result,but when mass data,it runs slow.

这个sql可以得到正确的结果,但是在海量数据的时候,运行速度很慢。

**Is there any way to do this without left outer join? **

**有没有办法在没有左外连接的情况下做到这一点?**

回答by patricksurry

There's a nearly undocumented feature of Hive SQL (I found it in one of their Jira bug reports) that lets you do something like argmax() using struct()s. For example if you have a table like:

Hive SQL 有一个几乎未公开的功能(我在他们的一份 Jira 错误报告中找到了它),它允许您使用 struct() 执行诸如 argmax() 之类的操作。例如,如果你有一张像这样的表:

test_argmax
id,val,key
1,1,A
1,2,B
1,3,C
1,2,D
2,1,E
2,1,U
2,2,V
2,3,W
2,2,X
2,1,Y

You can do this:

你可以这样做:

select 
  max(struct(val, key, id)).col1 as max_val,
  max(struct(val, key, id)).col2 as max_key,
  max(struct(val, key, id)).col3 as max_id
from test_argmax
group by id

and get the result:

并得到结果:

max_val,max_key,max_id
3,C,1
3,W,2

I think in case of ties on val (the first struct element) it will fall back to comparison on the second column. I also haven't figured out whether there's a neater syntax for getting the individual columns back out of the resulting struct, maybe using named_struct somehow?

我认为如果 val (第一个结构元素)有关系,它将回退到第二列的比较。我也没有弄清楚是否有更简洁的语法可以将单个列从结果结构中取出,也许以某种方式使用了 named_struct ?

回答by Mateo

There is a relatively recent feature of Hive SQL, analytic functions and the over clause. This should do the job without joins

Hive SQL、分析函数和 over 子句有一个相对较新的特性。这应该在没有连接的情况下完成工作

select id, name, age, last_modified 
from ( select id, name, age, modified, 
              max( modified) over (partition by id) as last_modified 
       from test ) as sub
where   modified = last_modified 

What's going on here is that the subquery produces a new row with an extra column last_modified which has the latest modified timestamp for the corresponding person's id. (Similar to what group by would do) The key here is that the subquery gets you again one row per row in your original table and then you filter from that.

这里发生的事情是子查询生成一个新行,其中有一个额外的列 last_modified ,该列具有对应人员 ID 的最新修改时间戳。(类似于 group by 会做什么)这里的关键是子查询再次让您在原始表中每行一行,然后从中过滤。

There is a chance that even the simpler solution works:

即使是更简单的解决方案也有可能奏效:

select  id, name, age,  
        max( modified) over (partition by id) last_modified 
from test 
where   modified = last_modified 

By the way, the same code would work in Impala, too.

顺便说一下,同样的代码也适用于 Impala。

回答by Rahul Sharma

Just slightly different approach than what has been answered in previous answer.

与先前答案中回答的方法略有不同。

Below example uses hive windowingfunction to find out the latest record, read more here

下面的例子使用hive 窗口函数来找出最新的记录,在这里阅读更多

SELECT t.id
    ,t.name
    ,t.age
    ,t.modified
FROM (
    SELECT id
        ,name
        ,age
        ,modified
        ,ROW_NUMBER() OVER (
            PARTITION BY id ORDER BY unix_timestamp(modified,'yyyy-MM-dd hh:mm:ss') DESC
            ) AS ROW_NUMBER   
    FROM test
    ) t
WHERE t.ROW_NUMBER <= 1;

The modified is string so converting it to timestamp using unix_timestamp(modified,'yyyy-MM-dd hh:mm:ss')then applying order by on timestamp.

修改后的是字符串,因此使用unix_timestamp(modified,'yyyy-MM-dd hh:mm:ss')然后在时间戳上应用 order by将其转换为时间戳。

回答by Mosty Mostacho

Give this a try:

试试这个:

select t1.* from test t1
join (
  select id, max(modifed) maxModified from test
  group by id
) s
on t1.id = s.id and t1.modifed = s.maxModified

Fiddle here.

在这里摆弄。

Left outer join solution here.

左外连接解决方​​案在这里

Let us know which one runs faster :)

让我们知道哪个跑得更快:)

回答by SRIRAM

try this

尝试这个

select id,name,age,modified from test
 where modified=max(modified)
 group by id,name

回答by pensz

If u can make sure that row which has max modified also has max age in same id row set.

如果您可以确保已最大修改的行在同一 id 行集中也具有最大年龄。

Try

尝试

select id, name, max(age), max(modified) 
from test
group by id, name

回答by user 923227

Presume the data is like this:

假设数据是这样的:

    id      name    age     modifed
    1       a       10      2011-11-11 11:11:11
    1       a       11      2012-11-11 12:00:00
    2       b       23      2012-12-10 10:11:12
    2       b       21      2012-12-10 10:11:12
    2       b       22      2012-12-15 10:11:12
    2       b       20      2012-12-15 10:11:12

then the result of the above query will give you - (notice the repeated 2, b having the same date time)

那么上面查询的结果会给你 - (注意重复的 2, b 具有相同的日期时间)

    1       a       11      2012-11-11 12:00:00
    2       b       22      2012-12-15 10:11:12
    2       b       20      2012-12-15 10:11:12

This query runs an additional group by and is less efficient but gives the correct result -

此查询运行一个额外的 group by 并且效率较低,但给出了正确的结果 -

    select collect_set(b.id)[0], collect_set(b.name)[0], collect_set(b.age)[0], b.modified
    from
        (select id, max(modified) as modified from test group by id) a
      left outer join
        test b
      on
        (a.id=b.id and a.modified=b.modified)
    group by
      b.modified;

then the result of the above query will give you

那么上述查询的结果会给你

    1       a       11      2012-11-11 12:00:00
    2       b       20      2012-12-15 10:11:12

Now if we improve the query a little - then in place of 3 MRs it runs only one Keping the result same -

现在,如果我们稍微改进一下查询 - 那么它代替 3 个 MR,它只运行一个 Keping 结果相同 -

    select id, collect_set(name)[0], collect_set(age)[0], max(modified)
    from test 
    group by id;

Note - this will slow down in case your group by field produces large results.

注意 - 如果您的按字段分组产生大量结果,这会减慢速度。

回答by aditya

You can get the required result without using left outer join like this:

您可以在不使用左外连接的情况下获得所需的结果,如下所示:

select * from test where (id, modified) in(select id, max(modified) from test group by id)

select * from test where (id, modified) in(select id, max(modified) from test group by id)

http://sqlfiddle.com/#!2/bfbd5/42

http://sqlfiddle.com/#!2/bfbd5/42