在 Spark Scala 中重命名 DataFrame 的列名

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35592917/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:02:48  来源:igfitidea点击:

Renaming column names of a DataFrame in Spark Scala

scalaapache-sparkdataframeapache-spark-sql

提问by Sam

I am trying to convert all the headers / column names of a DataFramein Spark-Scala. as of now I come up with following code which only replaces a single column name.

我正在尝试转换DataFrameSpark-Scala中 a 的所有标题/列名称。到目前为止,我想出了以下代码,它只替换了一个列名。

for( i <- 0 to origCols.length - 1) {
  df.withColumnRenamed(
    df.columns(i), 
    df.columns(i).toLowerCase
  );
}

回答by zero323

If structure is flat:

如果结构是平坦的:

val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
//  |-- _1: long (nullable = false)
//  |-- _2: string (nullable = true)
//  |-- _3: string (nullable = true)
//  |-- _4: double (nullable = false)

the simplest thing you can do is to use toDFmethod:

您可以做的最简单的事情是使用toDF方法:

val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)

dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)

If you want to rename individual columns you can use either selectwith alias:

如果要重命名单个列,可以使用以下任一select方式alias

df.select($"_1".alias("x1"))

which can be easily generalized to multiple columns:

可以很容易地推广到多列:

val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

or withColumnRenamed:

withColumnRenamed

df.withColumnRenamed("_1", "x1")

which use with foldLeftto rename multiple columns:

使用 withfoldLeft重命名多个列:

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

With nested structures (structs) one possible option is renaming by selecting a whole structure:

对于嵌套结构 ( structs),一种可能的选择是通过选择整个结构来重命名:

val nested = spark.read.json(sc.parallelize(Seq(
    """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))

nested.printSchema
// root
//  |-- foobar: struct (nullable = true)
//  |    |-- foo: struct (nullable = true)
//  |    |    |-- bar: struct (nullable = true)
//  |    |    |    |-- first: double (nullable = true)
//  |    |    |    |-- second: double (nullable = true)
//  |-- id: long (nullable = true)

@transient val foobarRenamed = struct(
  struct(
    struct(
      $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
    ).alias("point")
  ).alias("location")
).alias("record")

nested.select(foobarRenamed, $"id").printSchema
// root
//  |-- record: struct (nullable = false)
//  |    |-- location: struct (nullable = false)
//  |    |    |-- point: struct (nullable = false)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
//  |-- id: long (nullable = true)

Note that it may affect nullabilitymetadata. Another possibility is to rename by casting:

请注意,它可能会影响nullability元数据。另一种可能性是通过铸造重命名:

nested.select($"foobar".cast(
  "struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

or:

或者:

import org.apache.spark.sql.types._

nested.select($"foobar".cast(
  StructType(Seq(
    StructField("location", StructType(Seq(
      StructField("point", StructType(Seq(
        StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

回答by Tagar

For those of you interested in PySpark version (actually it's same in Scala - see comment below) :

对于那些对 PySpark 版本感兴趣的人(实际上它在 Scala 中是相同的 - 请参阅下面的评论):

    merchants_df_renamed = merchants_df.toDF(
        'merchant_id', 'category', 'subcategory', 'merchant')

    merchants_df_renamed.printSchema()

Result:

结果:

root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)

root
|-- Merchant_id: 整数(nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- Merchant: string (nullable = true)

回答by Mylo Stone

def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
  t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}

In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.

如果不明显,这会为每个当前列名称添加一个前缀和一个后缀。当您有两个表具有一个或多个具有相同名称的列,并且您希望连接它们但仍然能够消除结果表中的列的歧义时,这会很有用。如果在“普通”SQL 中有类似的方法来执行此操作,那肯定会很好。

回答by Jagadeesh Verri

Suppose the dataframe df has 3 columns id1, name1, price1 and you wish to rename them to id2, name2, price2

假设数据框 df 有 3 列 id1, name1, price1 并且您希望将它们重命名为 id2, name2, price2

val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)

I found this approach useful in many cases.

我发现这种方法在很多情况下都很有用。