在 Hive/SQL 中获取具有最大值的行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20642654/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 00:30:27  来源:igfitidea点击:

Get row with max value in Hive/SQL?

sqlhive

提问by marc

I'm new to Hive/SQL, and I'm stuck on a fairly simple problem. My data looks like:

我是 Hive/SQL 的新手,我遇到了一个相当简单的问题。我的数据看起来像:

+------------+--------------------+-----------------------+
| carrier_iD |     meandelay      |     meancanceled      |
+------------+--------------------+-----------------------+
| EV         | 13.795802119653473 | 0.028584251044292006  |
| VX         | 0.450591016548463  | 2.364066193853424E-4  |
| F9         | 10.898001378359766 | 0.00206753962784287   |
| AS         | 0.5071547420965062 | 0.0057404326123128135 |
| HA         | 1.2031093279839498 | 5.015045135406214E-4  |
| 9E         | 8.147899230704216  | 0.03876067292247866   |
| B6         | 9.45383857757506   | 0.003162096314343487  |
| UA         | 8.101511665305816  | 0.005467725574605967  |
| FL         | 0.7265068895709532 | 0.0041141513746490044 |
| WN         | 7.156119279121648  | 0.0057419058192869415 |
| DL         | 4.206288692245839  | 0.005123990066804269  |
| YV         | 6.316802855264404  | 0.029304029304029346  |
| US         | 3.2221527095063736 | 0.007984031936127766  |
| OO         | 6.954715814690328  | 0.02596499362466706   |
| MQ         | 9.74568222216328   | 0.025628100708354324  |
| AA         | 8.720522654298968  | 0.019242775597574157  |
+------------+--------------------+-----------------------+

I want Hive to return the row with the meanDelay max value. I have:

我希望 Hive 返回具有 meanDelay 最大值的行。我有:

SELECT CAST(MAX(meandelay) as FLOAT) FROM flightinfo;

which indeed returns the max (I use cast because my values are saved as STRING). So then:

这确实返回了最大值(我使用强制转换,因为我的值被保存为字符串)。那么:

SELECT * FROM flightinfo WHERE meandelay = (SELECT CAST(MAX(meandelay) AS FLOAT) FROM flightinfo);

I get the following error:

我收到以下错误:

FAILED: ParseException line 1:44 cannot recognize input near 'select' 'cast' '(' in expression specification

回答by libHyman

Use the windowing and analytics functions

使用窗口和分析功能

SELECT carrier_id, meandelay, meancanceled
FROM
 (SELECT carrier_id, meandelay, meancanceled,
         rank() over (order by cast(meandelay as float) desc) as r 
  FROM table) S 
WHERE S.r = 1;

This will also solve the problem if more than one row has the same max value, you'll get all the rows as result. If you just want a single row change rank()to row_number()or add another term to the order by.

如果不止一行具有相同的最大值,这也将解决问题,您将获得所有行作为结果。如果你只是想要一个行更改rank()row_number()或添加其他期限的order by

回答by dimamah

use join instead.

使用 join 代替。

SELECT a.* FROM flightinfo a left semi join  
(SELECT CAST(MAX(meandelay) AS FLOAT)  
maxdelay FROM flightinfo)b on (a.meandelay=b.maxdelay)

回答by Jerome Banks

You can use the collect_maxUDF from Brickhouse ( http://github.com/klout/brickhouse) to solve this problem, passing in a value of 1, meaning that you only want the single max value.

您可以使用collect_maxBrickhouse ( http://github.com/klout/brickhouse) 中的UDF来解决此问题,传入值 1,这意味着您只需要单个最大值。

select array_index( map_keys( collect_max( carrier_id, meandelay, 1) ), 0 ) from flightinfo;

Also, I've read somewhere that the Hive maxUDF does allow you to access other fields on the row, but I think its easier just to use collect_max.

另外,我在某处读到 Hive maxUDF 确实允许您访问行上的其他字段,但我认为它更容易使用collect_max.

回答by BWS

I don't think your sub-query is allowed ...

我不认为你的子查询是允许的......

A quick look here:

快速浏览一下:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

states:

状态:

As of Hive 0.13 some types of subqueries are supported in the WHERE clause. Those are queries where the result of the query can be treated as a constant for IN and NOT IN statements (called uncorrelated subqueries because the subquery does not reference columns from the parent query):

从 Hive 0.13 开始, WHERE 子句支持某些类型的子查询。这些查询的结果可以被视为 IN 和 NOT IN 语句的常量(称为不相关子查询,因为子查询不引用父查询中的列):