SQL Hadoop Hive 查询:多连接
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6391755/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Hadoop Hive Query: Multi-join
提问by batman
How can I do sub-selections in Hive? I think I might be making a really obvious mistake that's not so obvious to me...
如何在 Hive 中进行子选择?我想我可能犯了一个对我来说并不那么明显的非常明显的错误......
Error I'm receiving: FAILED: Parse Error: line 4:8 cannot recognize input 'SELECT' in expression specification
我收到的错误: FAILED: Parse Error: line 4:8 cannot recognize input 'SELECT' in expression specification
Here are my three source tables:
这是我的三个源表:
aaa_hit -> [SESSION_KEY, HIT_KEY, URL]
aaa_event-> [SESSION_KEY,HIT_KEY,EVENT_ID]
aaa_session->[SESSION_KEY,REMOTE_ADDRESS]
...and what I want to do is insert the result into a result table like this:
...我想要做的是将结果插入到这样的结果表中:
result -> [url, num_url, event_id, num_event_id, remote_address, num_remote_address]
...where column 1 is the URL, column 3 is the top 1 "event" per URL, and column 5 is the top 1 REMOTE_ADDRESS to visit that URL. (Even columns are "count"s of the previous column.)
...其中第 1 列是 URL,第 3 列是每个 URL 的前 1 个“事件”,第 5 列是访问该 URL 的前 1 个 REMOTE_ADDRESS。(即使列是前一列的“计数”。)
Soooooo... what did I do wrong here?
Sooooooo...我在这里做错了什么?
INSERT OVERWRITE TABLE result2
SELECT url,
COUNT(url) AS access_url,
(SELECT events.event_id as evt,
COUNT(events.event_id) as access_evt
FROM aaa_event events
LEFT OUTER JOIN aaa_hit hits
ON ( events.hit_key = hit_key )
ORDER BY access_evt DESC LIMIT 1),
(SELECT sessions.remote_address as remote_address,
COUNT(sessions.remote_address) as access_addr
FROM aaa_session sessions
RIGHT OUTER JOIN aaa_hit hits
ON ( sessions.session_key = session_key )
ORDER BY access_addr DESC LIMIT 1)
FROM aaa_hit
ORDER BY access_url DESC;
Thank you so much :)
非常感谢 :)
回答by QuinnG
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
Hive supports subqueries only in the FROM clause.
Hive 仅支持 FROM 子句中的子查询。
You can't use a subquery as a 'column' in Hive.
您不能将子查询用作 Hive 中的“列”。
To work around this you'll want to use that subquery in a FROM clause and JOIN
to it. (the below won't work, but is the idea)
要解决此问题,您需要在 FROM 子句中使用该子查询并JOIN
对其进行处理。(以下不起作用,但这是想法)
SELECT url,
COUNT(url) AS access_url,
t2.col1, t2.col2 ...
FROM aaa_hit
JOIN (SELECT events.event_id as evt,
COUNT(events.event_id) as access_evt
FROM aaa_event events
LEFT OUTER JOIN aaa_hit hits
ON ( events.hit_key = hit_key )
ORDER BY access_evt DESC LIMIT 1),
(SELECT sessions.remote_address as remote_address,
COUNT(sessions.remote_address) as access_addr
FROM aaa_session sessions
RIGHT OUTER JOIN aaa_hit hits
ON ( sessions.session_key = session_key )
ORDER BY access_addr DESC LIMIT 1) t2
ON (aaa_hit.THING = t2.THING)
Check out https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joinsfor more information on using JOINs in Hive.
查看https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins有关在 Hive 中使用 JOIN 的更多信息。
回答by Disha Umarwani
You do not have a GroupBy operation, Count is an aggregation. Only count(*) works without a GroupBy clause.
您没有 GroupBy 操作,Count 是一个聚合。只有 count(*) 可以在没有 GroupBy 子句的情况下工作。
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy