scala 如何更改 Spark SQL 的 DataFrame 中的列类型？

Question

提问by kevinykuo

Suppose I'm doing something like:

假设我正在做类似的事情：

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()

root
 |-- year: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- blank: string (nullable = true)

df.show()
year make  model comment              blank
2012 Tesla S     No comment                
1997 Ford  E350  Go get one now th...

but I really wanted the yearas Int(and perhaps transform some other columns).

但我真的很想要yearas Int（也许还需要转换一些其他的列）。

The best I could come up with is

我能想到的最好的办法是

df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]

which is a bit convoluted.

这有点令人费解。

I'm coming from R, and I'm used to being able to write, e.g.

我来自 R，我已经习惯了能够写作，例如

df2 <- df %>%
   mutate(year = year %>% as.integer, 
          make = make %>% toupper)

I'm likely missing something, since there should be a better way to do this in spark/scala...

我可能遗漏了一些东西，因为在 spark/scala 中应该有更好的方法来做到这一点......

Answer 1

回答by msemelman

Edit: Newest version

编辑：最新版本

Since spark 2.x you can use .withColumn. Check the docs here:

从 spark 2.x 开始，您可以使用.withColumn. 检查这里的文档：

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@withColumn(colName:String,col:org.apache.spark.sql.Column) :org.apache.spark.sql.DataFrame

Oldest answer

最旧的答案

Since Spark version 1.4 you can apply the cast method with DataType on the column:

从 Spark 1.4 版开始，您可以在列上应用带有 DataType 的 cast 方法：

import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
    .drop("year")
    .withColumnRenamed("yearTmp", "year")

If you are using sql expressions you can also do:

如果您使用的是 sql 表达式，您还可以执行以下操作：

val df2 = df.selectExpr("cast(year as int) year", 
                        "make", 
                        "model", 
                        "comment", 
                        "blank")

For more info check the docs: http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

有关更多信息，请查看文档：http: //spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

Answer 2

回答by Svend

[EDIT: March 2016: thanks for the votes! Though really, this is not the best answer, I think the solutions based on withColumn, withColumnRenamedand castput forward by msemelman, Martin Senne and others are simpler and cleaner].

[编辑：2016 年 3 月：感谢投票！但实际上，这不是最好的答案，我想基础上的解决方案withColumn，withColumnRenamed并cast通过msemelman提出，马丁Senne等是简单和清晰。

I think your approach is ok, recall that a Spark DataFrameis an (immutable) RDD of Rows, so we're never really replacinga column, just creating new DataFrameeach time with a new schema.

我认为你的方法没问题，回想一下 SparkDataFrame是一个（不可变的）行的 RDD，所以我们从来没有真正替换过一列，只是DataFrame每次都用一个新的模式创建新的。

Assuming you have an original df with the following schema:

假设您有一个具有以下架构的原始 df：

scala> df.printSchema
root
 |-- Year: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- DayofMonth: string (nullable = true)
 |-- DayOfWeek: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- CRSDepTime: string (nullable = true)

And some UDF's defined on one or several columns:

还有一些 UDF 在一列或几列上定义：

import org.apache.spark.sql.functions._

val toInt    = udf[Int, String]( _.toInt)
val toDouble = udf[Double, String]( _.toDouble)
val toHour   = udf((t: String) => "%04d".format(t.toInt).take(2).toInt ) 
val days_since_nearest_holidays = udf( 
  (year:String, month:String, dayOfMonth:String) => year.toInt + 27 + month.toInt-12
 )

Changing column types or even building a new DataFrame from another can be written like this:

更改列类型甚至从另一个构建新的 DataFrame 可以这样编写：

val featureDf = df
.withColumn("departureDelay", toDouble(df("DepDelay")))
.withColumn("departureHour",  toHour(df("CRSDepTime")))
.withColumn("dayOfWeek",      toInt(df("DayOfWeek")))              
.withColumn("dayOfMonth",     toInt(df("DayofMonth")))              
.withColumn("month",          toInt(df("Month")))              
.withColumn("distance",       toDouble(df("Distance")))              
.withColumn("nearestHoliday", days_since_nearest_holidays(
              df("Year"), df("Month"), df("DayofMonth"))
            )              
.select("departureDelay", "departureHour", "dayOfWeek", "dayOfMonth", 
        "month", "distance", "nearestHoliday")

which yields:

产生：

scala> df.printSchema
root
 |-- departureDelay: double (nullable = true)
 |-- departureHour: integer (nullable = true)
 |-- dayOfWeek: integer (nullable = true)
 |-- dayOfMonth: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- distance: double (nullable = true)
 |-- nearestHoliday: integer (nullable = true)

This is pretty close to your own solution. Simply, keeping the type changes and other transformations as separate udf vals make the code more readable and re-usable.

这与您自己的解决方案非常接近。简单地说，将类型更改和其他转换保持为单独的udf vals 使代码更具可读性和可重用性。

Answer 3

回答by Martin Senne

As the castoperation is available for Spark Column's (and as I personally do not favour udf's as proposed by @Svendat this point), how about:

由于该cast操作可用于 SparkColumn的（而且我个人不赞成udf@Svend在这一点上提出的 's ），怎么样：

df.select( df("year").cast(IntegerType).as("year"), ... )

to cast to the requested type? As a neat side effect, values not castable / "convertable" in that sense, will become null.

强制转换为请求的类型？作为一个巧妙的副作用，从这个意义上说，不可铸造/“可转换”的值将变为null.

In case you need this as a helper method, use:

如果您需要它作为辅助方法，请使用：

object DFHelper{
  def castColumnTo( df: DataFrame, cn: String, tpe: DataType ) : DataFrame = {
    df.withColumn( cn, df(cn).cast(tpe) )
  }
}

which is used like:

使用方式如下：

import DFHelper._
val df2 = castColumnTo( df, "year", IntegerType )

Answer 4

回答by u2130573

First, if you wanna cast type, then this:

首先，如果你想投射类型，那么这个：

import org.apache.spark.sql
df.withColumn("year", $"year".cast(sql.types.IntegerType))

With same column name, the column will be replaced with new one. You don't need to do add and delete steps.

具有相同列名的列将被替换为新列。您无需执行添加和删除步骤。

Second, about Scalavs R.
This is the code that most similar to R I can come up with:

其次，关于斯卡拉VS [R 。
这是与 RI 最相似的代码：

val df2 = df.select(
   df.columns.map {
     case year @ "year" => df(year).cast(IntegerType).as(year)
     case make @ "make" => functions.upper(df(make)).as(make)
     case other         => df(other)
   }: _*
)

Though the code length is a little longer than R's. That is nothing to do with the verbosity of the language. In R the mutateis a special function for R dataframe, while in Scala you can easily ad-hoc one thanks to its expressive power.
In word, it avoid specific solutions, because the language design is good enough for you to quickly and easy build your own domain language.

虽然代码长度比R的稍长。这与语言的冗长无关。在 R 中，这mutate是 R 数据帧的一个特殊函数，而在 Scala 中，由于其表达能力，您可以轻松地进行临时设置。
一句话，它避免了特定的解决方案，因为语言设计足以让您快速轻松地构建自己的领域语言。

side note: df.columnsis surprisingly a Array[String]instead of Array[Column], maybe they want it look like Python pandas's dataframe.

旁注：df.columns令人惊讶的是一个Array[String]而不是Array[Column]，也许他们希望它看起来像 Python pandas 的数据框。

Answer 5

回答by dnlbrky

You can use selectExprto make it a little cleaner:

您可以使用selectExpr使其更清洁：

df.selectExpr("cast(year as int) as year", "upper(make) as make",
    "model", "comment", "blank")

Answer 6

回答by manishbelsare

Java code for modifying the datatype of the DataFrame from String to Integer

将DataFrame的数据类型从String修改为Integer的Java代码

df.withColumn("col_name", df.col("col_name").cast(DataTypes.IntegerType))

It will simply cast the existing(String datatype) to Integer.

它将简单地将现有的（字符串数据类型）转换为整数。

Answer 7

回答by Peter Rose

To convert the year from string to int, you can add the following option to the csv reader: "inferSchema" -> "true", see DataBricks documentation

要将年份从字符串转换为整数，您可以将以下选项添加到 csv 阅读器：“inferSchema”->“true”，请参阅DataBricks 文档

Answer 8

回答by ben jarman

So this only really works if your having issues saving to a jdbc driver like sqlserver, but it's really helpful for errors you will run into with syntax and types.

因此，这仅在您将问题保存到 sqlserver 等 jdbc 驱动程序时才真正有效，但它对于您在语法和类型方面遇到的错误非常有帮助。

import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect}
import org.apache.spark.sql.jdbc.JdbcType
val SQLServerDialect = new JdbcDialect {
  override def canHandle(url: String): Boolean = url.startsWith("jdbc:jtds:sqlserver") || url.contains("sqlserver")

  override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
    case StringType => Some(JdbcType("VARCHAR(5000)", java.sql.Types.VARCHAR))
    case BooleanType => Some(JdbcType("BIT(1)", java.sql.Types.BIT))
    case IntegerType => Some(JdbcType("INTEGER", java.sql.Types.INTEGER))
    case LongType => Some(JdbcType("BIGINT", java.sql.Types.BIGINT))
    case DoubleType => Some(JdbcType("DOUBLE PRECISION", java.sql.Types.DOUBLE))
    case FloatType => Some(JdbcType("REAL", java.sql.Types.REAL))
    case ShortType => Some(JdbcType("INTEGER", java.sql.Types.INTEGER))
    case ByteType => Some(JdbcType("INTEGER", java.sql.Types.INTEGER))
    case BinaryType => Some(JdbcType("BINARY", java.sql.Types.BINARY))
    case TimestampType => Some(JdbcType("DATE", java.sql.Types.DATE))
    case DateType => Some(JdbcType("DATE", java.sql.Types.DATE))
    //      case DecimalType.Fixed(precision, scale) => Some(JdbcType("NUMBER(" + precision + "," + scale + ")", java.sql.Types.NUMERIC))
    case t: DecimalType => Some(JdbcType(s"DECIMAL(${t.precision},${t.scale})", java.sql.Types.DECIMAL))
    case _ => throw new IllegalArgumentException(s"Don't know how to save ${dt.json} to JDBC")
  }
}

JdbcDialects.registerDialect(SQLServerDialect)

Answer 9

回答by user8106134

Generate a simple dataset containing five values and convert intto stringtype:

生成一个包含五个值的简单数据集并转换int为string类型：

val df = spark.range(5).select( col("id").cast("string") )

Answer 10

回答by sauraI3h

the answers suggesting to use cast, FYI, the cast method in spark 1.4.1 is broken.

建议使用强制转换的答案，仅供参考，火花 1.4.1 中的强制转换方法已损坏。

for example, a dataframe with a string column having value "8182175552014127960" when casted to bigint has value "8182175552014128100"

例如，当转换为 bigint 时，具有值为“8182175552014127960”的字符串列的数据框具有值“8182175552014128100”

    df.show
+-------------------+
|                  a|
+-------------------+
|8182175552014127960|
+-------------------+

    df.selectExpr("cast(a as bigint) a").show
+-------------------+
|                  a|
+-------------------+
|8182175552014128100|
+-------------------+

We had to face a lot of issue before finding this bug because we had bigint columns in production.

在发现这个 bug 之前，我们不得不面对很多问题，因为我们在生产中有 bigint 列。

scala 如何更改 Spark SQL 的 DataFrame 中的列类型？

提问by kevinykuo

回答by msemelman

Edit: Newest version

编辑：最新版本

Oldest answer

最旧的答案

回答by Svend

回答by Martin Senne

回答by u2130573

回答by dnlbrky

回答by manishbelsare

回答by Peter Rose

回答by ben jarman

回答by user8106134

回答by sauraI3h

相关推荐

最近更新

标签

scala 如何更改 Spark SQL 的 DataFrame 中的列类型？

提问by kevinykuo

回答by msemelman

Edit: Newest version

编辑：最新版本

Oldest answer

最旧的答案

回答by Svend

回答by Martin Senne

回答by u2130573

回答by dnlbrky

回答by manishbelsare

回答by Peter Rose

回答by ben jarman

回答by user8106134

回答by sauraI3h

相关推荐

Scala 中的日期差异

scala 如何解决此异常

Scala - 将 Array[String] 转换为 Array[Double]

scala 如何从 pyspark 设置 hadoop 配置值

相关推荐

最近更新

标签