scala Spark Group By Key to (Key,List) Pair

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34344455/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:52:35  来源:igfitidea点击:

Spark Group By Key to (Key,List) Pair

scalaapache-spark

提问by manjam

I am trying to group some data by key where the value would be a list:

我正在尝试按键对一些数据进行分组,其中值将是一个列表:

Sample data:

样本数据:

A 1
A 2
B 1
B 2

Expected result:

预期结果:

(A,(1,2))
(B,(1,2))

I am able to do this with the following code:

我可以使用以下代码执行此操作:

data.groupByKey().mapValues(List(_))

The problem is that when I then try to do a Map operation like the following:

问题是,当我尝试执行如下 Map 操作时:

groupedData.map((k,v) => (k,v(0))) 

It tells me I have the wrong number of parameters.

它告诉我我的参数数量错误。

If I try:

如果我尝试:

groupedData.map(s => (s(0),s(1)))

It tells me that "(Any,List(Iterable(Any)) does not take parameters"

它告诉我“(Any,List(Iterable(Any)) 不带参数”

No clue what I am doing wrong. Is my grouping wrong? What would be a better way to do this?

不知道我做错了什么。我的分组有误吗?什么是更好的方法来做到这一点?

Scala answers only please. Thanks!!

Scala 只请回答。谢谢!!

回答by zero323

You're almost there. Just replace List(_)with _.toList

你快到了。只需替换List(_)_.toList

data.groupByKey.mapValues(_.toList)

回答by Shadowlands

When you write an anonymous inline function of the form

当您编写表单的匿名内联函数时

ARGS => OPERATION

the entire part before the arrow (=>) is taken as the argument list. So, in the case of

将箭头 ( =>)之前的整个部分作为参数列表。所以,在这种情况下

(k, v) => ...

the interpreter takes that to mean a function that takes two arguments. In your case, however, you have a single argument which happens to be a tuple (here, a Tuple2, or a Pair- more fully, you appear to have a list of Pair[Any,List[Any]]). There are a couple of ways to get around this. First, you can use the sugared form of representing a pair, wrapped in an extra set of parentheses to show that this is the single expected argument for the function:

解释器认为这是一个带有两个参数的函数。但是,在您的情况下,您有一个恰好是元组的参数(这里, aTuple2或 a Pair- 更完整地说,您似乎有一个 的列表Pair[Any,List[Any]])。有几种方法可以解决这个问题。首先,您可以使用表示一对的加糖形式,用一组额外的括号括起来,以表明这是该函数的单个预期参数:

((x, y)) => ...

or, you can write the anonymous function in the form of a partial function that matches on tuples:

或者,您可以以匹配元组的部分函数的形式编写匿名函数:

groupedData.map( case (k,v) => (k,v(0)) ) 

Finally, you can simply go with a single specified argument, as per your last attempt, but - realising it is a tuple - reference the specific field(s) within the tuple that you need:

最后,您可以根据上次尝试简单地使用单个指定参数,但是 - 意识到它是一个元组 - 引用您需要的元组中的特定字段:

groupedData.map(s => (s._2(0),s._2(1)))  // The key is s._1, and the value list is s._2