hive sql 聚合
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7600726/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
hive sql aggregate
提问by chnet
I have two tables in Hive, t1
and t2
我在 Hive 中有两张桌子,t1
并且t2
>describe t1;
>date_id string
>describe t2;
>messageid string,
createddate string,
userid int
> select * from t1 limit 3;
> 2011-01-01 00:00:00
2011-01-02 00:00:00
2011-01-03 00:00:00
> select * from t2 limit 3;
87211389 2011-01-03 23:57:01 13864753
87211656 2011-01-03 23:57:59 13864769
87211746 2011-01-03 23:58:25 13864785
What I want is to count previous three-day distinct userid for a given date.
For example, for date 2011-01-03
, I want to count distinct userid from 2011-01-01
to 2011-01-03
.
for date 2011-01-04
, I want to count distinct userid from 2011-01-02
to 2011-01-04
我想要的是计算给定日期的前三天不同用户 ID。
例如,对于 date 2011-01-03
,我想从2011-01-01
to计算不同的 userid 2011-01-03
。
对于日期2011-01-04
,我想从2011-01-02
到计算不同的用户 ID2011-01-04
I wrote the following query. But it does not return three-day result. It returns distinct userid per day instead.
我写了以下查询。但它不会返回三天的结果。它每天返回不同的用户 ID。
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
`to_date()` and `date_sub()` are date function in Hive.
That said, the following part does not take effect.
也就是说,以下部分不生效。
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
EDIT: One solution can be (but it is super slow):
编辑:一种解决方案可以是(但它超级慢):
SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
SELECT * FROM t1 LEFT OUTER JOIN t2
WHERE
(date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
)
) t3
GROUP by to_date(t3.date_id);
UPDATE: Thanks for all answers. They are good.
But Hive is a bit different from SQL. Unfortunately, they cannot use in HIVE.
My current solution is to use UNION ALL
.
更新:感谢所有答案。他们很好。
但是 Hive 与 SQL 有点不同。不幸的是,它们不能在 HIVE 中使用。我目前的解决方案是使用UNION ALL
.
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)
Then, I do group by
and count
. In this way, I can get what I want.
Although it is not elegant, it is much efficient than cross join
.
然后,我做group by
和count
。这样,我就能得到我想要的。
虽然不优雅,但比cross join
.
回答by MatBailie
The following should seem to work in standard SQL...
以下应该在标准 SQL 中工作......
SELECT
to_date(t1.date_id),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
AND to_date(t2.createddate) < date_add(to_date(t1.date_id), 1)
GROUP BY
to_date(t1.date_id)
It will,however, be slow. Because you are storing dates as strings, the using to_date() to convert them to dates. What this means is that indexes can't be used, and the SQL engine can't do Anything clever to reduce the effort being expended.
这会,但是,很慢。因为您将日期存储为字符串,所以使用 to_date() 将它们转换为日期。这意味着无法使用索引,并且 SQL 引擎无法做任何聪明的事情来减少所花费的精力。
As a result, every possible combinationof rows needs to be compared. If you have 100 entries in T1 and 10,000 entries in T2, your SQL engine is processing a million combinations.
因此,需要比较每一种可能的行组合。如果 T1 中有 100 个条目,T2 中有 10,000 个条目,则您的 SQL 引擎正在处理一百万个组合。
If you store these values as dates, you don't need to_date()
. And if you index the dates, the SQL engine can quickly home in on the range of dates being specified.
如果将这些值存储为日期,则不需要to_date()
. 如果您索引日期,SQL 引擎可以快速定位到指定的日期范围。
NOTE: The format of the ON
clause means that you do notneed to round t2.createddate
down to a daily value.
注意:ON
子句的格式意味着您不需要t2.createddate
向下舍入到每日值。
EDITWhy your code didn't work...
编辑为什么你的代码不起作用......
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
This joins t1 to t2 with an ON
clause of (to_date(t2.createddate) = to_date(t1.date_id))
. As the join is a LEFT OUTER JOIN, the values in t2.createddate
MUST now eitherbe NULL (no matches) or be the same as t1.date_id
.
这使用 的ON
子句将 t1 连接到 t2 (to_date(t2.createddate) = to_date(t1.date_id))
。作为联接的左外连接,在值t2.createddate
现在必须要么是NULL(没有匹配)或是相同的t1.date_id
。
The WHERE
clause allows a much wider range (3 days). But the ON
clause of the JOIN
has already restricted you data down to a single day.
该WHERE
条款允许范围更广(3 天)。但是ON
条款JOIN
已经将您的数据限制在一天之内。
The example I gave above simply takes your WHERE
clause and put's it in place of the old ON
clause.
我上面给出的例子只是简单地采用了你的WHERE
子句并将它放在旧ON
子句的位置。
EDIT
编辑
Hive doesn't allow <=
and >=
in the ON clause? Are you reallyfixed in to using HIVE???
Hive 不允许<=
和>=
在 ON 子句中?你真的习惯使用 HIVE吗???
If you really are, what about BETWEEN?
如果你真的是,那么 BETWEEN 呢?
SELECT
to_date(t1.date_id),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
to_date(t1.date_id)
Alternatively, refactor your table of dates to enumerate the dates you want to include...
或者,重构您的日期表以枚举您想要包含的日期...
TABLE t1 (calendar_date, inclusive_date) =
{ 2011-01-03, 2011-01-01
2011-01-03, 2011-01-02
2011-01-03, 2011-01-03
2011-01-04, 2011-01-02
2011-01-04, 2011-01-03
2011-01-04, 2011-01-04
2011-01-05, 2011-01-03
2011-01-05, 2011-01-04
2011-01-05, 2011-01-05 }
SELECT
to_date(t1.calendar_date),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
to_date(t1.calendar_date)
回答by ???
You need a subquery:
你需要一个子查询:
try something like this (i cannot test because i don't have hive)
尝试这样的事情(我无法测试,因为我没有蜂巢)
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE t2.messageid in
(
select t2.messageid from t2 where
date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND
to_date(t2.createddate) <= to_date(t1.date_id)
)
GROUP by to_date(t1.date_id);
the key is that with subquery FOR EACH date in t1, the right records are selected in t2.
关键是使用 t1 中的 FOR EACH 日期子查询,在 t2 中选择了正确的记录。
EDIT:
编辑:
Forcing subquery in from clause you could try this:
在 from 子句中强制子查询你可以试试这个:
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN
(select userid, createddate from t2 where
date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND
to_date(t2.createddate) <= to_date(t1.date_id)
) as t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
GROUP by to_date(t1.date_id);
but don't know if could work.
但不知道能不能用。
回答by Russell Hart
I am making an assumption that t1 is used to define the 3 day period. I suspect the puzzling approach is due to Hive's shortcomings. This allows you to have an arbitrary number of 3 day periods. Try the following 2 queries
我假设 t1 用于定义 3 天期限。我怀疑这种令人费解的方法是由于 Hive 的缺点。这允许您拥有任意数量的 3 天周期。尝试以下 2 个查询
SELECT substring(t1.date_id,1,10), count(distinct t2.userid)
FROM t1
JOIN t2
ON substring(t2.createddate,1,10) >= date_sub(substring(t1.date_id,1,10), 2)
AND substring(t2.createddate,1,10) <= substring(t1.date_id,1,10)
GROUP BY t1.date_id
--or--
- 或者 -
SELECT substring(t1.date_id,1,10), count(distinct t2.userid)
FROM t1
JOIN t2
ON t2.createddate like substring(t1.date_id ,1,10) + '%'
OR t2.createddate like substring(date_sub(t1.date_id, 1) ,1,10) + '%'
OR t2.createddate like substring(date_sub(t1.date_id, 2) ,1,10) + '%'
GROUP BY t1.date_id
The latter minimizes the function calls on the t2 table. I am also assuming that t1 is the smaller of the 2. substring should return the same result as to_date. According to the documentation, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions, to_date returns a string data type. Support for date data types seems minimal but I am not familiar with hive.
后者最小化了 t2 表上的函数调用。我还假设 t1 是 2 中较小的一个。子字符串应该返回与 to_date 相同的结果。根据文档,https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions, to_date 返回字符串数据类型。对日期数据类型的支持似乎很少,但我不熟悉 hive。
回答by Bogdan Sahlean
1.I am not familiar with Hive.
1.我对Hive不熟悉。
2.You could try using a subquery in FROM
clase:
2.您可以尝试在类中使用子查询FROM
:
SELECT T1.date_id, COUNT(x.userid) AS UserCount
FROM T1
LEFT OUTER JOIN
(
SELECT TO_DATE(createddate) AS date_id, userid
FROM T2
GROUP BY TO_DATE(createddate), userid
) X ON DATE_SUB(TO_DATE(T1.date_id),3) <= X.date_id AND X.date_id <= TO_DATE(T1.date_id)
GROUP BY T1.date_id;