在 Spark Scala 中重命名 DataFrame 的列名

Question

提问by Sam

I am trying to convert all the headers / column names of a DataFramein Spark-Scala. as of now I come up with following code which only replaces a single column name.

我正在尝试转换DataFrameSpark-Scala中 a 的所有标题/列名称。到目前为止，我想出了以下代码，它只替换了一个列名。

for( i <- 0 to origCols.length - 1) {
  df.withColumnRenamed(
    df.columns(i), 
    df.columns(i).toLowerCase
  );
}

Answer 1

回答by zero323

If structure is flat:

如果结构是平坦的：

val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
//  |-- _1: long (nullable = false)
//  |-- _2: string (nullable = true)
//  |-- _3: string (nullable = true)
//  |-- _4: double (nullable = false)

the simplest thing you can do is to use toDFmethod:

您可以做的最简单的事情是使用toDF方法：

val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)

dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)

If you want to rename individual columns you can use either selectwith alias:

如果要重命名单个列，可以使用以下任一select方式alias：

df.select($"_1".alias("x1"))

which can be easily generalized to multiple columns:

可以很容易地推广到多列：

val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

or withColumnRenamed:

或withColumnRenamed：

df.withColumnRenamed("_1", "x1")

which use with foldLeftto rename multiple columns:

使用 withfoldLeft重命名多个列：

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

With nested structures (structs) one possible option is renaming by selecting a whole structure:

对于嵌套结构 ( structs)，一种可能的选择是通过选择整个结构来重命名：

val nested = spark.read.json(sc.parallelize(Seq(
    """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))

nested.printSchema
// root
//  |-- foobar: struct (nullable = true)
//  |    |-- foo: struct (nullable = true)
//  |    |    |-- bar: struct (nullable = true)
//  |    |    |    |-- first: double (nullable = true)
//  |    |    |    |-- second: double (nullable = true)
//  |-- id: long (nullable = true)

@transient val foobarRenamed = struct(
  struct(
    struct(
      $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
    ).alias("point")
  ).alias("location")
).alias("record")

nested.select(foobarRenamed, $"id").printSchema
// root
//  |-- record: struct (nullable = false)
//  |    |-- location: struct (nullable = false)
//  |    |    |-- point: struct (nullable = false)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
//  |-- id: long (nullable = true)

Note that it may affect nullabilitymetadata. Another possibility is to rename by casting:

请注意，它可能会影响nullability元数据。另一种可能性是通过铸造重命名：

nested.select($"foobar".cast(
  "struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

or:

或者：

import org.apache.spark.sql.types._

nested.select($"foobar".cast(
  StructType(Seq(
    StructField("location", StructType(Seq(
      StructField("point", StructType(Seq(
        StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

Answer 2

回答by Tagar

For those of you interested in PySpark version (actually it's same in Scala - see comment below) :

对于那些对 PySpark 版本感兴趣的人（实际上它在 Scala 中是相同的 - 请参阅下面的评论）：

    merchants_df_renamed = merchants_df.toDF(
        'merchant_id', 'category', 'subcategory', 'merchant')

    merchants_df_renamed.printSchema()

Result:

结果：

root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)

root
|-- Merchant_id: 整数(nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- Merchant: string (nullable = true)

Answer 3

回答by Mylo Stone

def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
  t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}

In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.

如果不明显，这会为每个当前列名称添加一个前缀和一个后缀。当您有两个表具有一个或多个具有相同名称的列，并且您希望连接它们但仍然能够消除结果表中的列的歧义时，这会很有用。如果在“普通”SQL 中有类似的方法来执行此操作，那肯定会很好。

Answer 4

回答by Jagadeesh Verri

Suppose the dataframe df has 3 columns id1, name1, price1 and you wish to rename them to id2, name2, price2

假设数据框 df 有 3 列 id1, name1, price1 并且您希望将它们重命名为 id2, name2, price2

val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)

I found this approach useful in many cases.

我发现这种方法在很多情况下都很有用。

在 Spark Scala 中重命名 DataFrame 的列名

提问by Sam

回答by zero323

回答by Tagar

回答by Mylo Stone

回答by Jagadeesh Verri

相关推荐

最近更新

标签

在 Spark Scala 中重命名 DataFrame 的列名

提问by Sam

回答by zero323

回答by Tagar

回答by Mylo Stone

回答by Jagadeesh Verri

相关推荐

scala 使用作为字符串数组的行字段过滤火花数据框

Spark Scala：无法导入 sqlContext.implicits._

如何在 Scala 中使用同步？

scala 在 Spark 中四舍五入

相关推荐

最近更新

标签