scala 如何在groupBy之后将值聚合到集合中？

Question

提问by Eric Patterson

I have a dataframe with schema as such:

我有一个带有架构的数据框：

[visitorId: string, trackingIds: array<string>, emailIds: array<string>]

Looking for a way to group (or maybe rollup?) this dataframe by visitorid where the trackingIds and emailIds columns would append together. So for example if my initial df looks like:

寻找一种方法来分组（或汇总？）此数据框按访问者 ID，其中 trackingIds 和 emailIds 列将附加在一起。例如，如果我的初始 df 看起来像：

visitorId   |trackingIds|emailIds
+-----------+------------+--------
|a158|      [666b]      |    [12]
|7g21|      [c0b5]      |    [45]
|7g21|      [c0b4]      |    [87]
|a158|      [666b, 777c]|    []

I would like my output df to look like this

我希望我的输出 df 看起来像这样

visitorId   |trackingIds|emailIds
+-----------+------------+--------
|a158|      [666b,666b,777c]|      [12,'']
|7g21|      [c0b5,c0b4]     |      [45, 87]

Attempting to use groupByand aggoperators but not have much luck.

尝试使用groupBy和agg运营商但运气不佳。

Answer 1

采纳答案by zero323

Spark >= 2.4

火花 >= 2.4

You can replace flattenudfwith built-in flattenfunction

您可以替换flattenudf为内置flatten函数

import org.apache.spark.sql.functions.flatten

leaving the rest as-is.

其余部分保持原样。

Spark >= 2.0, < 2.4

火花 >= 2.0, < 2.4

It is possible but quite expensive. Using data you've provided:

这是可能的，但相当昂贵。使用您提供的数据：

case class Record(
    visitorId: String, trackingIds: Array[String], emailIds: Array[String])

val df = Seq(
  Record("a158", Array("666b"), Array("12")),
  Record("7g21", Array("c0b5"), Array("45")),
  Record("7g21", Array("c0b4"), Array("87")),
  Record("a158", Array("666b",  "777c"), Array.empty[String])).toDF

and a helper function:

和一个辅助函数：

import org.apache.spark.sql.functions.udf

val flatten = udf((xs: Seq[Seq[String]]) => xs.flatten)

we can fill the blanks with placeholders:

我们可以用占位符填空：

import org.apache.spark.sql.functions.{array, lit, when}

val dfWithPlaceholders = df.withColumn(
  "emailIds", 
  when(size($"emailIds") === 0, array(lit(""))).otherwise($"emailIds"))

collect_listsand flatten:

collect_lists和flatten：

import org.apache.spark.sql.functions.{array, collect_list}

val emailIds = flatten(collect_list($"emailIds")).alias("emailIds")
val trackingIds = flatten(collect_list($"trackingIds")).alias("trackingIds")

df
  .groupBy($"visitorId")
  .agg(trackingIds, emailIds)

// +---------+------------------+--------+
// |visitorId|       trackingIds|emailIds|
// +---------+------------------+--------+
// |     a158|[666b, 666b, 777c]|  [12, ]|
// |     7g21|      [c0b5, c0b4]|[45, 87]|
// +---------+------------------+--------+

With statically typed Dataset:

使用静态类型Dataset：

df.as[Record]
  .groupByKey(_.visitorId)
  .mapGroups { case (key, vs) => 
    vs.map(v => (v.trackingIds, v.emailIds)).toArray.unzip match {
      case (trackingIds, emailIds) => 
        Record(key, trackingIds.flatten, emailIds.flatten)
  }}

// +---------+------------------+--------+
// |visitorId|       trackingIds|emailIds|
// +---------+------------------+--------+
// |     a158|[666b, 666b, 777c]|  [12, ]|
// |     7g21|      [c0b5, c0b4]|[45, 87]|
// +---------+------------------+--------+

Spark 1.x

火花1.x

You can convert to RDD and group

您可以转换为 RDD 和组

import org.apache.spark.sql.Row

dfWithPlaceholders.rdd
  .map {
     case Row(id: String, 
       trcks: Seq[String @ unchecked],
       emails: Seq[String @ unchecked]) => (id, (trcks, emails))
  }
  .groupByKey
  .map {case (key, vs) => vs.toArray.unzip match {
    case (trackingIds, emailIds) => 
      Record(key, trackingIds.flatten, emailIds.flatten)
  }}
  .toDF

// +---------+------------------+--------+
// |visitorId|       trackingIds|emailIds|
// +---------+------------------+--------+
// |     7g21|      [c0b5, c0b4]|[45, 87]|
// |     a158|[666b, 666b, 777c]|  [12, ]|
// +---------+------------------+--------+

Answer 2

回答by Jacek Laskowski

@zero323's answer is prettymuch complete, but Spark gives us even more flexibility. How about the following solution?

@zero323 的回答非常完整，但 Spark 为我们提供了更大的灵活性。下面的解决方案如何？

import org.apache.spark.sql.functions._
inventory
  .select($"*", explode($"trackingIds") as "tracking_id")
  .select($"*", explode($"emailIds") as "email_id")
  .groupBy("visitorId")
  .agg(
    collect_list("tracking_id") as "trackingIds",
    collect_list("email_id") as "emailIds")

That however leaves out all empty collections (so there's some room for improvement :))

然而，这遗漏了所有空的集合（所以有一些改进的空间:)）

Answer 3

回答by gourav sb

You can use User defined aggregated functions.

您可以使用用户定义的聚合函数。

1) create a custom UDAF using the scala class called customAggregation.

1) 使用名为 customAggregation 的 Scala 类创建自定义 UDAF。

package com.package.name

import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._

class CustomAggregation() extends UserDefinedAggregateFunction {

// Input Data Type Schema
def inputSchema: StructType = StructType(Array(StructField("col5", ArrayType(StringType))))

// Intermediate Schema
def bufferSchema = StructType(Array(
StructField("col5_collapsed",  ArrayType(StringType))))

// Returned Data Type .
def dataType: DataType = ArrayType(StringType)

// Self-explaining
def deterministic = true

// This function is called whenever key changes
def initialize(buffer: MutableAggregationBuffer) = {
buffer(0) = Array.empty[String] // initialize array
}

// Iterate over each entry of a group
def update(buffer: MutableAggregationBuffer, input: Row) = {
buffer(0) =
  if(!input.isNullAt(0))
    buffer.getList[String](0).toArray ++ input.getList[String](0).toArray
  else
    buffer.getList[String](0).toArray
}

  // Merge two partial aggregates
 def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
 buffer1(0) = buffer1.getList[String](0).toArray ++ buffer2.getList[String](0).toArray
}

 // Called after all the entries are exhausted.
 def evaluate(buffer: Row) = {
  buffer.getList[String](0).asScala.toList.distinct
 }
}

2) Then use the UDAF in your code as

2）然后在您的代码中使用UDAF作为

//define UDAF
val CustomAggregation = new CustomAggregation()
DataFrame
    .groupBy(col1,col2,col3)
    .agg(CustomAggregation(DataFrame(col5))).show()

scala 如何在groupBy之后将值聚合到集合中？

提问by Eric Patterson

采纳答案by zero323

回答by Jacek Laskowski

回答by gourav sb

相关推荐

最近更新

标签

scala 如何在groupBy之后将值聚合到集合中？

提问by Eric Patterson

采纳答案by zero323

回答by Jacek Laskowski

回答by gourav sb

相关推荐

scala 使用scala计算spark数据帧中列组合的实例

Spark Scala 列出目录中的文件夹

scala 如何检查数据帧？

在 Scala/Spark 中将纪元转换为日期时间

相关推荐

最近更新

标签