SQL 如何按操作员从 Hive 组中获取元素的数组/包?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16444070/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get array/bag of elements from Hive group by operator?
提问by Anuroop
I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:-
我想按给定的字段分组并使用分组的字段获取输出。下面是我试图实现的一个例子:-
Imagine a table named 'sample_table' with two columns as below:-
想象一个名为“sample_table”的表,其中包含如下两列:-
F1 F2
001 111
001 222
001 123
002 222
002 333
003 555
I want to write Hive Query that will give the below output:-
我想编写 Hive Query 来提供以下输出:-
001 [111, 222, 123]
002 [222, 333]
003 [555]
In Pig, this can be very easily achieved by something like this:-
在 Pig 中,这可以通过以下方式轻松实现:-
grouped_relation = GROUP sample_table BY F1;
Can somebody please suggest if there is a simple way to do so in Hive? What I can think of is to write a User Defined Function (UDF) for this but this may be a very time consuming option.
有人可以建议在 Hive 中是否有一种简单的方法吗?我能想到的是为此编写一个用户定义函数(UDF),但这可能是一个非常耗时的选择。
回答by Daniel Koverman
The built in aggregate function collect_set
(doumented here) gets you almost what you want. It would actually work on your example input:
内置的聚合函数collect_set
(此处为 doumented)几乎可以满足您的需求。它实际上适用于您的示例输入:
SELECT F1, collect_set(F2)
FROM sample_table
GROUP BY F1
Unfortunately, it also removes duplicate elements and I imagine this isn't your desired behavior. I find it odd that collect_set
exists, but no version to keep duplicates. Someone else apparently thought the same thing. It looks like the top and second answer there will give you the UDAF you need.
不幸的是,它还删除了重复的元素,我想这不是您想要的行为。我觉得collect_set
存在很奇怪,但没有版本可以保留重复项。其他人显然也有同样的想法。看起来那里的顶部和第二个答案将为您提供所需的 UDAF。
回答by ellaqezi
collect_set actually works as expected since a set as per definition is a collection of well defined and distinctobjects i.e. objects occur exactly once or not at all within a set.
collect_set 实际上按预期工作,因为根据定义的集合是定义明确且不同的对象的集合,即对象在集合中只出现一次或根本不出现。