SQL Hadoop Hive 查询：多连接

Question

提问by batman

How can I do sub-selections in Hive? I think I might be making a really obvious mistake that's not so obvious to me...

如何在 Hive 中进行子选择？我想我可能犯了一个对我来说并不那么明显的非常明显的错误......

Error I'm receiving: FAILED: Parse Error: line 4:8 cannot recognize input 'SELECT' in expression specification

我收到的错误： FAILED: Parse Error: line 4:8 cannot recognize input 'SELECT' in expression specification

Here are my three source tables:

这是我的三个源表：

aaa_hit -> [SESSION_KEY, HIT_KEY, URL]
aaa_event-> [SESSION_KEY,HIT_KEY,EVENT_ID]
aaa_session->[SESSION_KEY,REMOTE_ADDRESS]

...and what I want to do is insert the result into a result table like this:

...我想要做的是将结果插入到这样的结果表中：

result -> [url, num_url, event_id, num_event_id, remote_address, num_remote_address]

...where column 1 is the URL, column 3 is the top 1 "event" per URL, and column 5 is the top 1 REMOTE_ADDRESS to visit that URL. (Even columns are "count"s of the previous column.)

...其中第 1 列是 URL，第 3 列是每个 URL 的前 1 个“事件”，第 5 列是访问该 URL 的前 1 个 REMOTE_ADDRESS。（即使列是前一列的“计数”。）

Soooooo... what did I do wrong here?

Sooooooo...我在这里做错了什么？

INSERT OVERWRITE TABLE result2
SELECT url, 
       COUNT(url) AS access_url, 
       (SELECT events.event_id as evt, 
               COUNT(events.event_id) as access_evt
        FROM   aaa_event events 
               LEFT OUTER JOIN aaa_hit hits 
                 ON ( events.hit_key = hit_key )
                 ORDER BY access_evt DESC LIMIT 1), 
       (SELECT sessions.remote_address as remote_address, 
               COUNT(sessions.remote_address) as access_addr
        FROM   aaa_session sessions 
               RIGHT OUTER JOIN aaa_hit hits 
                 ON ( sessions.session_key = session_key )
                 ORDER BY access_addr DESC LIMIT 1) 
FROM   aaa_hit
ORDER  BY access_url DESC;

Thank you so much :)

非常感谢：）

Answer 1

回答by QuinnG

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

Hive supports subqueries only in the FROM clause.

Hive 仅支持 FROM 子句中的子查询。

You can't use a subquery as a 'column' in Hive.

您不能将子查询用作 Hive 中的“列”。

To work around this you'll want to use that subquery in a FROM clause and JOINto it. (the below won't work, but is the idea)

要解决此问题，您需要在 FROM 子句中使用该子查询并JOIN对其进行处理。（以下不起作用，但这是想法）

SELECT url, 
       COUNT(url) AS access_url, 
       t2.col1, t2.col2 ...
FROM   aaa_hit
JOIN (SELECT events.event_id as evt, 
               COUNT(events.event_id) as access_evt
        FROM   aaa_event events 
               LEFT OUTER JOIN aaa_hit hits 
                 ON ( events.hit_key = hit_key )
                 ORDER BY access_evt DESC LIMIT 1), 
       (SELECT sessions.remote_address as remote_address, 
               COUNT(sessions.remote_address) as access_addr
        FROM   aaa_session sessions 
               RIGHT OUTER JOIN aaa_hit hits 
                 ON ( sessions.session_key = session_key )
                 ORDER BY access_addr DESC LIMIT 1) t2
ON (aaa_hit.THING = t2.THING)

Check out https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joinsfor more information on using JOINs in Hive.

查看https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins有关在 Hive 中使用 JOIN 的更多信息。

Answer 2

回答by Disha Umarwani

You do not have a GroupBy operation, Count is an aggregation. Only count(*) works without a GroupBy clause.

您没有 GroupBy 操作，Count 是一个聚合。只有 count(*) 可以在没有 GroupBy 子句的情况下工作。

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy

SQL Hadoop Hive 查询：多连接

提问by batman

回答by QuinnG

回答by Disha Umarwani

相关推荐

最近更新

标签

SQL Hadoop Hive 查询：多连接

提问by batman

回答by QuinnG

回答by Disha Umarwani

相关推荐

SQL 什么是数据库关闭？

sql ORDER BY 按特定顺序排列多个值？

在 SQL CTE 中调用存储过程

SQL mongo 中的外键？

相关推荐

最近更新

标签