scala 在 Spark 中使用自定义函数聚合多列

Question

提问by anthonybell

I was wondering if there is some way to specify a custom aggregation function for spark dataframes over multiple columns.

我想知道是否有某种方法可以为多列的火花数据帧指定自定义聚合函数。

I have a table like this of the type (name, item, price):

我有一个这样的表格（名称，项目，价格）：

john | tomato | 1.99
john | carrot | 0.45
bill | apple  | 0.99
john | banana | 1.29
bill | taco   | 2.59

to:

到：

I would like to aggregate the item and it's cost for each person into a list like this:

我想将每个人的项目和成本汇总到这样的列表中：

john | (tomato, 1.99), (carrot, 0.45), (banana, 1.29)
bill | (apple, 0.99), (taco, 2.59)

Is this possible in dataframes? I recently learned about collect_listbut it appears to only work for one column.

这在数据帧中可能吗？我最近了解到，collect_list但它似乎只适用于一列。

Answer 1

采纳答案by David Griffin

The easiest way to do this as a DataFrameis to first collect two lists, and then use a UDFto zipthe two lists together. Something like:

最简单的方法来做到这一点的DataFrame是先收集两个列表，然后使用UDF到zip两个列表在一起。就像是：

import org.apache.spark.sql.functions.{collect_list, udf}
import sqlContext.implicits._

val zipper = udf[Seq[(String, Double)], Seq[String], Seq[Double]](_.zip(_))

val df = Seq(
  ("john", "tomato", 1.99),
  ("john", "carrot", 0.45),
  ("bill", "apple", 0.99),
  ("john", "banana", 1.29),
  ("bill", "taco", 2.59)
).toDF("name", "food", "price")

val df2 = df.groupBy("name").agg(
  collect_list(col("food")) as "food",
  collect_list(col("price")) as "price" 
).withColumn("food", zipper(col("food"), col("price"))).drop("price")

df2.show(false)
# +----+---------------------------------------------+
# |name|food                                         |
# +----+---------------------------------------------+
# |john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]|
# |bill|[[apple,0.99], [taco,2.59]]                  |
# +----+---------------------------------------------+

Answer 2

回答by Daniel Siegmann

Consider using the structfunction to group the columns together before collecting as a list:

考虑struct在收集为列表之前使用该函数将列分组在一起：

import org.apache.spark.sql.functions.{collect_list, struct}
import sqlContext.implicits._

val df = Seq(
  ("john", "tomato", 1.99),
  ("john", "carrot", 0.45),
  ("bill", "apple", 0.99),
  ("john", "banana", 1.29),
  ("bill", "taco", 2.59)
).toDF("name", "food", "price")

df.groupBy($"name")
  .agg(collect_list(struct($"food", $"price")).as("foods"))
  .show(false)

Outputs:

输出：

+----+---------------------------------------------+
|name|foods                                        |
+----+---------------------------------------------+
|john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]|
|bill|[[apple,0.99], [taco,2.59]]                  |
+----+---------------------------------------------+

Answer 3

回答by Yifan Guo

Maybe a better way than the zipfunction (since UDF and UDAF are very bad to performance) is to wrap the two columns into Struct.

也许比zip函数更好的方法（因为 UDF 和 UDAF 对性能非常不利）是将两列包装成Struct.

This would probably work as well:

这可能也有效：

df.select('name, struct('food, 'price).as("tuple"))
  .groupBy('name)
  .agg(collect_list('tuple).as("tuples"))

Answer 4

回答by Psidom

Here is an option by converting the data frame to a RDD of Map and then call a groupByKeyon it. The result would be a list of key-value pairs where value is a list of tuples.

这是通过将数据框转换为 Map 的 RDD 然后对其调用 a 的选项groupByKey。结果将是一个键值对列表，其中 value 是一个元组列表。

df.show
+----+------+----+
|  _1|    _2|  _3|
+----+------+----+
|john|tomato|1.99|
|john|carrot|0.45|
|bill| apple|0.99|
|john|banana|1.29|
|bill|  taco|2.59|
+----+------+----+


val tuples = df.map(row => row(0) -> (row(1), row(2)))
tuples: org.apache.spark.rdd.RDD[(Any, (Any, Any))] = MapPartitionsRDD[102] at map at <console>:43

tuples.groupByKey().map{ case(x, y) => (x, y.toList) }.collect
res76: Array[(Any, List[(Any, Any)])] = Array((bill,List((apple,0.99), (taco,2.59))), (john,List((tomato,1.99), (carrot,0.45), (banana,1.29))))

Answer 5

回答by Neha Kumari

To your point collect_list appears to only work for one column: For collect_list to work on multiple columns you will have to wrap the columns you want as aggregate in a struct. For e.g :

就您而言，collect_list 似乎仅适用于一列：要使 collect_list 在多列上工作，您必须将想要的列作为聚合体包装在一个结构中。例如：

     val aggregatedData = df.groupBy("name").agg(collect_list(struct("item", "price")) as("food"))

     aggregatedData.show
+----+------------------------------------------------+
|name|foods                                           |
+----+------------------------------------------------+
|john|[[tomato, 1.99], [carrot, 0.45], [banana, 1.29]]|
|bill|[[apple, 0.99], [taco, 2.59]]                   |
+----+------------------------------------------------+

scala 在 Spark 中使用自定义函数聚合多列

提问by anthonybell

采纳答案by David Griffin

回答by Daniel Siegmann

回答by Yifan Guo

回答by Psidom

回答by Neha Kumari

相关推荐

最近更新

标签

scala 在 Spark 中使用自定义函数聚合多列

提问by anthonybell

采纳答案by David Griffin

回答by Daniel Siegmann

回答by Yifan Guo

回答by Psidom

回答by Neha Kumari

相关推荐

scala 如何在spark/scala中对数据帧的一列的值求和

Spark Scala：如何转换 DF 中的列

使用 Maven 创建最基本的 Scala 项目？

Spark Scala：按小时或分钟显示两列的DateDiff

相关推荐

最近更新

标签