在 Hive/SQL 中获取具有最大值的行？

Question

提问by marc

I'm new to Hive/SQL, and I'm stuck on a fairly simple problem. My data looks like:

我是 Hive/SQL 的新手，我遇到了一个相当简单的问题。我的数据看起来像：

+------------+--------------------+-----------------------+
| carrier_iD |     meandelay      |     meancanceled      |
+------------+--------------------+-----------------------+
| EV         | 13.795802119653473 | 0.028584251044292006  |
| VX         | 0.450591016548463  | 2.364066193853424E-4  |
| F9         | 10.898001378359766 | 0.00206753962784287   |
| AS         | 0.5071547420965062 | 0.0057404326123128135 |
| HA         | 1.2031093279839498 | 5.015045135406214E-4  |
| 9E         | 8.147899230704216  | 0.03876067292247866   |
| B6         | 9.45383857757506   | 0.003162096314343487  |
| UA         | 8.101511665305816  | 0.005467725574605967  |
| FL         | 0.7265068895709532 | 0.0041141513746490044 |
| WN         | 7.156119279121648  | 0.0057419058192869415 |
| DL         | 4.206288692245839  | 0.005123990066804269  |
| YV         | 6.316802855264404  | 0.029304029304029346  |
| US         | 3.2221527095063736 | 0.007984031936127766  |
| OO         | 6.954715814690328  | 0.02596499362466706   |
| MQ         | 9.74568222216328   | 0.025628100708354324  |
| AA         | 8.720522654298968  | 0.019242775597574157  |
+------------+--------------------+-----------------------+

I want Hive to return the row with the meanDelay max value. I have:

我希望 Hive 返回具有 meanDelay 最大值的行。我有：

SELECT CAST(MAX(meandelay) as FLOAT) FROM flightinfo;

which indeed returns the max (I use cast because my values are saved as STRING). So then:

这确实返回了最大值（我使用强制转换，因为我的值被保存为字符串）。那么：

SELECT * FROM flightinfo WHERE meandelay = (SELECT CAST(MAX(meandelay) AS FLOAT) FROM flightinfo);

I get the following error:

我收到以下错误：

FAILED: ParseException line 1:44 cannot recognize input near 'select' 'cast' '(' in expression specification

Answer 1

回答by libHyman

Use the windowing and analytics functions

使用窗口和分析功能

SELECT carrier_id, meandelay, meancanceled
FROM
 (SELECT carrier_id, meandelay, meancanceled,
         rank() over (order by cast(meandelay as float) desc) as r 
  FROM table) S 
WHERE S.r = 1;

This will also solve the problem if more than one row has the same max value, you'll get all the rows as result. If you just want a single row change rank()to row_number()or add another term to the order by.

如果不止一行具有相同的最大值，这也将解决问题，您将获得所有行作为结果。如果你只是想要一个行更改rank()到row_number()或添加其他期限的order by。

Answer 2

回答by dimamah

use join instead.

使用 join 代替。

SELECT a.* FROM flightinfo a left semi join  
(SELECT CAST(MAX(meandelay) AS FLOAT)  
maxdelay FROM flightinfo)b on (a.meandelay=b.maxdelay)

Answer 3

回答by Jerome Banks

You can use the collect_maxUDF from Brickhouse ( http://github.com/klout/brickhouse) to solve this problem, passing in a value of 1, meaning that you only want the single max value.

您可以使用collect_maxBrickhouse ( http://github.com/klout/brickhouse) 中的UDF来解决此问题，传入值 1，这意味着您只需要单个最大值。

select array_index( map_keys( collect_max( carrier_id, meandelay, 1) ), 0 ) from flightinfo;

Also, I've read somewhere that the Hive maxUDF does allow you to access other fields on the row, but I think its easier just to use collect_max.

另外，我在某处读到 Hive maxUDF 确实允许您访问行上的其他字段，但我认为它更容易使用collect_max.

Answer 4

回答by BWS

I don't think your sub-query is allowed ...

我不认为你的子查询是允许的......

A quick look here:

快速浏览一下：

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

states:

状态：

As of Hive 0.13 some types of subqueries are supported in the WHERE clause. Those are queries where the result of the query can be treated as a constant for IN and NOT IN statements (called uncorrelated subqueries because the subquery does not reference columns from the parent query):

从 Hive 0.13 开始， WHERE 子句支持某些类型的子查询。这些查询的结果可以被视为 IN 和 NOT IN 语句的常量（称为不相关子查询，因为子查询不引用父查询中的列）：

在 Hive/SQL 中获取具有最大值的行？

提问by marc

回答by libHyman

回答by dimamah

回答by Jerome Banks

回答by BWS

相关推荐

最近更新

标签

在 Hive/SQL 中获取具有最大值的行？

提问by marc

回答by libHyman

回答by dimamah

回答by Jerome Banks

回答by BWS

相关推荐

SQL "SELECT IN (Value1, Value2 ...)" 将值的变量传递到 GridView

SQL 如何检查给定模式中是否存在表

SQL 使用 .NET 批量插入到 Oracle

SQL “非空”和“非空启用”之间有区别吗？

相关推荐

最近更新

标签