SQL 如何按操作员从 Hive 组中获取元素的数组/包?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16444070/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 15:08:18  来源:igfitidea点击:

How to get array/bag of elements from Hive group by operator?

sqlhadoophiveapache-pigbigdata

提问by Anuroop

I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:-

我想按给定的字段分组并使用分组的字段获取输出。下面是我试图实现的一个例子:-

Imagine a table named 'sample_table' with two columns as below:-

想象一个名为“sample_table”的表,其中包含如下两列:-

F1  F2
001 111
001 222
001 123
002 222
002 333
003 555

I want to write Hive Query that will give the below output:-

我想编写 Hive Query 来提供以下输出:-

001 [111, 222, 123]
002 [222, 333]
003 [555]

In Pig, this can be very easily achieved by something like this:-

在 Pig 中,这可以通过以下方式轻松实现:-

grouped_relation = GROUP sample_table BY F1;

Can somebody please suggest if there is a simple way to do so in Hive? What I can think of is to write a User Defined Function (UDF) for this but this may be a very time consuming option.

有人可以建议在 Hive 中是否有一种简单的方法吗?我能想到的是为此编写一个用户定义函数(UDF),但这可能是一个非常耗时的选择。

回答by Daniel Koverman

The built in aggregate function collect_set(doumented here) gets you almost what you want. It would actually work on your example input:

内置的聚合函数collect_set此处为 doumented)几乎可以满足您的需求。它实际上适用于您的示例输入:

SELECT F1, collect_set(F2)
FROM sample_table
GROUP BY F1

Unfortunately, it also removes duplicate elements and I imagine this isn't your desired behavior. I find it odd that collect_setexists, but no version to keep duplicates. Someone else apparently thought the same thing. It looks like the top and second answer there will give you the UDAF you need.

不幸的是,它还删除了重复的元素,我想这不是您想要的行为。我觉得collect_set存在很奇怪,但没有版本可以保留重复项。其他人显然也有同样的想法。看起来那里的顶部和第二个答案将为您提供所需的 UDAF。

回答by ellaqezi

collect_set actually works as expected since a set as per definition is a collection of well defined and distinctobjects i.e. objects occur exactly once or not at all within a set.

collect_set 实际上按预期工作,因为根据定义的集合是定义明确且不同的对象的集合,即对象在集合中只出现一次或根本不出现。