scala 如何命名聚合列？

Question

提问by Emre

I'm using Spark in Scala and my aggregated columns are anonymous. Is there a convenient way to rename multiple columns from a dataset? I thought about imposing a schema with asbut the key column is a struct (due to the groupByoperation), and I can't find out how to define a case classwith a StructTypein it.

我在 Scala 中使用 Spark，我的聚合列是匿名的。有没有一种方便的方法来重命名数据集中的多列？我想到了征收模式与as更关键的列是一个结构（由于groupBy操作），我不能找出如何定义case class与StructType它。

I tried defining a schema as follows:

我尝试定义一个架构如下：

val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),
                                                             StructField("dst", IntegerType), true)), 
                              StructField("count", LongType, true))
edge_count.as[returnSchema]

but I got a compile error:

但我得到了一个编译错误：

Message: <console>:74: error: overloaded method value apply with alternatives:
  (fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
  (fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
  (fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
 cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, Boolean)
       val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),

Answer 1

采纳答案by Emre

I ended up using aliases with the selectstatement; e.g.,

我最终alias在select语句中使用了es ；例如，

ds.select($"key.src".as[Short], 
          $"key.dst".as[Short], 
          $"sum(count)".alias("count").as[Long])

First I had to use printSchemato determine the derived column names:

首先，我必须使用printSchema来确定派生列名称：

> ds.printSchema

root
 |-- key: struct (nullable = false)
 |    |-- src: short (nullable = false)
 |    |-- dst: short (nullable = false)
 |-- sum(count): long (nullable = true)

Answer 2

回答by Sim

The best solution is to name your columns explicitly, e.g.,

最好的解决方案是明确命名您的列，例如，

df
  .groupBy('a, 'b)
  .agg(
    expr("count(*) as cnt"),
    expr("sum(x) as x"),
    expr("sum(y)").as("y")
  )

If you are using a dataset, you have to provide the type of your columns, e.g., expr("count(*) as cnt").as[Long].

如果您使用的是数据集，则必须提供列的类型，例如expr("count(*) as cnt").as[Long].

You can use the DSL directly but I often find it to be more verbose than simple SQL expressions.

您可以直接使用 DSL，但我经常发现它比简单的 SQL 表达式更冗长。

If you want to do mass renames, use a Mapand then foldLeftthe dataframe.

如果要进行批量重命名，请使用 aMap然后foldLeft使用数据框。

scala 如何命名聚合列？

提问by Emre

采纳答案by Emre

回答by Sim

相关推荐

最近更新

标签

scala 如何命名聚合列？

提问by Emre

采纳答案by Emre

回答by Sim

相关推荐

scala 如何计算数据框中每一列的每个不同值的出现次数？

scala build.sbt：如何添加火花依赖

scala 如何使用字符串数组在火花数据框中将列名设置为 toDF() 函数？

无法在 IntelliJ 上创建 Scala 类

相关推荐

最近更新

标签