hive sql 聚合

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7600726/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 12:17:37  来源:igfitidea点击:

hive sql aggregate

sqlgroup-byhive

提问by chnet

I have two tables in Hive, t1and t2

我在 Hive 中有两张桌子,t1并且t2

>describe t1;
>date_id    string

>describe t2;
>messageid string,
 createddate string,
 userid int

> select * from t1 limit 3;        
> 2011-01-01 00:00:00 
  2011-01-02 00:00:00 
  2011-01-03 00:00:00 

> select * from t2 limit 3;
87211389    2011-01-03 23:57:01 13864753
87211656    2011-01-03 23:57:59 13864769
87211746    2011-01-03 23:58:25 13864785

What I want is to count previous three-day distinct userid for a given date.
For example, for date 2011-01-03, I want to count distinct userid from 2011-01-01to 2011-01-03.
for date 2011-01-04, I want to count distinct userid from 2011-01-02to 2011-01-04

我想要的是计算给定日期的前三天不同用户 ID。
例如,对于 date 2011-01-03,我想从2011-01-01to计算不同的 userid 2011-01-03
对于日期2011-01-04,我想从2011-01-02到计算不同的用户 ID2011-01-04

I wrote the following query. But it does not return three-day result. It returns distinct userid per day instead.

我写了以下查询。但它不会返回三天的结果。它每天返回不同的用户 ID。

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);

`to_date()` and `date_sub()` are date function in Hive. 

That said, the following part does not take effect.

也就是说,以下部分不生效。

WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 

EDIT: One solution can be (but it is super slow):

编辑:一种解决方案可以是(但它超级慢):

SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
 SELECT * FROM t1  LEFT OUTER JOIN t2
 WHERE 
 (date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
  AND to_date(t2.createddate) <= to_date(t1.date_id)
 )
) t3 
GROUP by to_date(t3.date_id);

UPDATE: Thanks for all answers. They are good.
But Hive is a bit different from SQL. Unfortunately, they cannot use in HIVE. My current solution is to use UNION ALL.

更新:感谢所有答案。他们很好。
但是 Hive 与 SQL 有点不同。不幸的是,它们不能在 HIVE 中使用。我目前的解决方案是使用UNION ALL.

 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
 UNION ALL
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
 UNION ALL 
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)

Then, I do group byand count. In this way, I can get what I want.
Although it is not elegant, it is much efficient than cross join.

然后,我做group bycount。这样,我就能得到我想要的。
虽然不优雅,但比cross join.

回答by MatBailie

The following should seem to work in standard SQL...

以下应该在标准 SQL 中工作......

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON  to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
    AND to_date(t2.createddate) <  date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)

It will,however, be slow. Because you are storing dates as strings, the using to_date() to convert them to dates. What this means is that indexes can't be used, and the SQL engine can't do Anything clever to reduce the effort being expended.

但是,很慢。因为您将日期存储为字符串,所以使用 to_date() 将它们转换为日期。这意味着无法使用索引,并且 SQL 引擎无法做任何聪明的事情来减少所花费的精力。

As a result, every possible combinationof rows needs to be compared. If you have 100 entries in T1 and 10,000 entries in T2, your SQL engine is processing a million combinations.

因此,需要比较每一种可能的行组合。如果 T1 中有 100 个条目,T2 中有 10,000 个条目,则您的 SQL 引擎正在处理一百万个组合。

If you store these values as dates, you don't need to_date(). And if you index the dates, the SQL engine can quickly home in on the range of dates being specified.

如果将这些值存储为日期,则不需要to_date(). 如果您索引日期,SQL 引擎可以快速定位到指定的日期范围。

NOTE: The format of the ONclause means that you do notneed to round t2.createddatedown to a daily value.

注意:ON子句的格式意味着您不需要t2.createddate向下舍入到每日值。



EDITWhy your code didn't work...

编辑为什么你的代码不起作用......

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);

This joins t1 to t2 with an ONclause of (to_date(t2.createddate) = to_date(t1.date_id)). As the join is a LEFT OUTER JOIN, the values in t2.createddateMUST now eitherbe NULL (no matches) or be the same as t1.date_id.

这使用 的ON子句将 t1 连接到 t2 (to_date(t2.createddate) = to_date(t1.date_id))。作为联接的左外连接,在值t2.createddate现在必须要么是NULL(没有匹配)或是相同的t1.date_id

The WHEREclause allows a much wider range (3 days). But the ONclause of the JOINhas already restricted you data down to a single day.

WHERE条款允许范围更广(3 天)。但是ON条款JOIN已经将您的数据限制在一天之内。

The example I gave above simply takes your WHEREclause and put's it in place of the old ONclause.

我上面给出的例子只是简单地采用了你的WHERE子句并将它放在旧ON子句的位置。

EDIT

编辑

Hive doesn't allow <=and >=in the ON clause? Are you reallyfixed in to using HIVE???

Hive 不允许<=>=在 ON 子句中?你真的习惯使用 HIVE吗???

If you really are, what about BETWEEN?

如果你真的是,那么 BETWEEN 呢?

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)



Alternatively, refactor your table of dates to enumerate the dates you want to include...

或者,重构您的日期表以枚举您想要包含的日期...

TABLE t1 (calendar_date, inclusive_date) =
{ 2011-01-03, 2011-01-01
  2011-01-03, 2011-01-02
  2011-01-03, 2011-01-03

  2011-01-04, 2011-01-02
  2011-01-04, 2011-01-03
  2011-01-04, 2011-01-04

  2011-01-05, 2011-01-03
  2011-01-05, 2011-01-04
  2011-01-05, 2011-01-05 }

SELECT
  to_date(t1.calendar_date),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
  to_date(t1.calendar_date)

回答by ???

You need a subquery:

你需要一个子查询:

try something like this (i cannot test because i don't have hive)

尝试这样的事情(我无法测试,因为我没有蜂巢)

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE t2.messageid in 
    (
    select t2.messageid from t2 where 
    date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
    AND 
    to_date(t2.createddate) <= to_date(t1.date_id) 
   )
GROUP by to_date(t1.date_id);

the key is that with subquery FOR EACH date in t1, the right records are selected in t2.

关键是使用 t1 中的 FOR EACH 日期子查询,在 t2 中选择了正确的记录。

EDIT:

编辑:

Forcing subquery in from clause you could try this:

在 from 子句中强制子查询你可以试试这个:

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN 

(select userid, createddate  from t2 where 

    date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
    AND 
    to_date(t2.createddate) <= to_date(t1.date_id) 
) as t2

ON (to_date(t2.createddate) = to_date(t1.date_id))  

GROUP by to_date(t1.date_id);

but don't know if could work.

但不知道能不能用。

回答by Russell Hart

I am making an assumption that t1 is used to define the 3 day period. I suspect the puzzling approach is due to Hive's shortcomings. This allows you to have an arbitrary number of 3 day periods. Try the following 2 queries

我假设 t1 用于定义 3 天期限。我怀疑这种令人费解的方法是由于 Hive 的缺点。这允许您拥有任意数量的 3 天周期。尝试以下 2 个查询

SELECT substring(t1.date_id,1,10), count(distinct t2.userid) 
FROM t1 
JOIN t2 
ON substring(t2.createddate,1,10) >= date_sub(substring(t1.date_id,1,10), 2) 
AND substring(t2.createddate,1,10) <=  substring(t1.date_id,1,10) 
GROUP BY t1.date_id 

--or--

- 或者 -

SELECT substring(t1.date_id,1,10), count(distinct t2.userid) 
FROM t1 
JOIN t2 
ON t2.createddate like substring(t1.date_id ,1,10) + '%' 
OR t2.createddate like substring(date_sub(t1.date_id, 1) ,1,10) + '%' 
OR t2.createddate like substring(date_sub(t1.date_id, 2) ,1,10) + '%' 
GROUP BY t1.date_id 

The latter minimizes the function calls on the t2 table. I am also assuming that t1 is the smaller of the 2. substring should return the same result as to_date. According to the documentation, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions, to_date returns a string data type. Support for date data types seems minimal but I am not familiar with hive.

后者最小化了 t2 表上的函数调用。我还假设 t1 是 2 中较小的一个。子字符串应该返回与 to_date 相同的结果。根据文档,https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions, to_date 返回字符串数据类型。对日期数据类型的支持似乎很少,但我不熟悉 hive。

回答by Bogdan Sahlean

1.I am not familiar with Hive.

1.我对Hive不熟悉。

2.You could try using a subquery in FROMclase:

2.您可以尝试在类中使用子查询FROM

SELECT  T1.date_id, COUNT(x.userid) AS UserCount
FROM    T1
LEFT OUTER JOIN
(
    SELECT  TO_DATE(createddate) AS date_id, userid
    FROM    T2
    GROUP BY TO_DATE(createddate), userid
) X ON DATE_SUB(TO_DATE(T1.date_id),3) <= X.date_id AND X.date_id <= TO_DATE(T1.date_id)
GROUP BY T1.date_id;