scala 如何命名聚合列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38576040/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to name aggregate columns?
提问by Emre
I'm using Spark in Scala and my aggregated columns are anonymous. Is there a convenient way to rename multiple columns from a dataset? I thought about imposing a schema with asbut the key column is a struct (due to the groupByoperation), and I can't find out how to define a case classwith a StructTypein it.
我在 Scala 中使用 Spark,我的聚合列是匿名的。有没有一种方便的方法来重命名数据集中的多列?我想到了征收模式与as更关键的列是一个结构(由于groupBy操作),我不能找出如何定义case class与StructType它。
I tried defining a schema as follows:
我尝试定义一个架构如下:
val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),
StructField("dst", IntegerType), true)),
StructField("count", LongType, true))
edge_count.as[returnSchema]
but I got a compile error:
但我得到了一个编译错误:
Message: <console>:74: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, Boolean)
val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),
采纳答案by Emre
I ended up using aliases with the selectstatement; e.g.,
我最终alias在select语句中使用了es ;例如,
ds.select($"key.src".as[Short],
$"key.dst".as[Short],
$"sum(count)".alias("count").as[Long])
First I had to use printSchemato determine the derived column names:
首先,我必须使用printSchema来确定派生列名称:
> ds.printSchema
root
|-- key: struct (nullable = false)
| |-- src: short (nullable = false)
| |-- dst: short (nullable = false)
|-- sum(count): long (nullable = true)
回答by Sim
The best solution is to name your columns explicitly, e.g.,
最好的解决方案是明确命名您的列,例如,
df
.groupBy('a, 'b)
.agg(
expr("count(*) as cnt"),
expr("sum(x) as x"),
expr("sum(y)").as("y")
)
If you are using a dataset, you have to provide the type of your columns, e.g., expr("count(*) as cnt").as[Long].
如果您使用的是数据集,则必须提供列的类型,例如expr("count(*) as cnt").as[Long].
You can use the DSL directly but I often find it to be more verbose than simple SQL expressions.
您可以直接使用 DSL,但我经常发现它比简单的 SQL 表达式更冗长。
If you want to do mass renames, use a Mapand then foldLeftthe dataframe.
如果要进行批量重命名,请使用 aMap然后foldLeft使用数据框。

