如何在不使用 Scala 案例类的情况下为 CSV 文件指定架构？

Question

提问by Ishan Kumar

I am loading a CSV file into a DataFrame as below.

我正在将 CSV 文件加载到 DataFrame 中，如下所示。

val conf=new SparkConf().setAppName("dataframes").setMaster("local")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._

val df = spark.
  read.  
  format("org.apache.spark.csv").
  option("header", true).
  csv("/home/cloudera/Book1.csv")
scala> df.printSchema()
root
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- age: string (nullable = true)

How to change agecolumn to be of type Int?

如何将age列更改为类型Int？

Answer 1

回答by Jacek Laskowski

Given val spark=SparkSession.builder().getOrCreate()I guessyou're using Spark 2.x.

鉴于val spark=SparkSession.builder().getOrCreate()我猜你正在使用 Spark 2.x。

First of all, please note that Spark 2.x has a native support for CSV format and as such does not require specifying the format by its long name, i.e. org.apache.spark.csv, but just csv.

首先，请注意，星火2.X具有CSV格式的原生支持，因此不要求其长名称指定格式，即org.apache.spark.csv，只是csv。

spark.read.format("csv")...

Since you use csvoperator, the CSV format is implied and so you can skip/remove format("csv").

由于您使用csv运算符，CSV 格式是隐含的，因此您可以跳过/删除format("csv").

// note that I removed format("csv")
spark.read.option("header", true).csv("/home/cloudera/Book1.csv")

With that you have plenty of options, but I strongly recommend using a case class for...just the schema. See the last solution if you're curious how to do it in Spark 2.0.

有了它，你有很多选择，但我强烈建议使用案例类......只是模式。如果您对如何在 Spark 2.0 中执行此操作感到好奇，请参阅最后一个解决方案。

cast operator

转换运算符

You could use castoperator.

您可以使用强制转换运算符。

scala> Seq("1").toDF("str").withColumn("num", 'str cast "int").printSchema
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = true)

Using StructType

使用结构类型

You can also use your own hand-crafted schema with StructTypeand StructFieldas follows:

您还可以将自己的手工架构与StructType和StructField 一起使用，如下所示：

import org.apache.spark.sql.types._    
val schema = StructType(
  StructField("str", StringType, true) :: 
  StructField("num", IntegerType, true) :: Nil)

scala> schema.printTreeString
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = true)

val q = spark.
  read.
  option("header", true).
  schema(schema).
  csv("numbers.csv")
scala> q.printSchema
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = true)

Schema DSL

模式 DSL

What I found quite interesting lately was so-called Schema DSL. The above schema built using StructTypeand StructFieldcan be re-written as follows:

我最近发现很有趣的是所谓的Schema DSL。上面使用StructType和构建的模式StructField可以重写如下：

import org.apache.spark.sql.types._
val schema = StructType(
  $"str".string ::
  $"num".int :: Nil) 
scala> schema.printTreeString
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = true)

// or even
val schema = new StructType().
  add($"str".string).
  add($"num".int)
scala> schema.printTreeString
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = true)

Encoders

编码器

Encoders are so easy to use that it's hard to believe you could not want them, even only to build a schema without dealing with StructType, StructFieldand DataType.

编码器非常易于使用，以至于很难相信您不想要它们，即使只是构建一个架构而不处理StructType,StructField和DataType。

// Define a business object that describes your dataset
case class MyRecord(str: String, num: Int)

// Use Encoders object to create a schema off the business object
import org.apache.spark.sql.Encoders    
val schema = Encoders.product[MyRecord].schema
scala> schema.printTreeString
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = false)

Answer 2

回答by vdep

There is inferSchemaoption to automatically recognize the type of the variable by:

可以inferSchema选择通过以下方式自动识别变量的类型：

val df=spark.read
  .format("org.apache.spark.csv")
  .option("header", true)
  .option("inferSchema", true) // <-- HERE
  .csv("/home/cloudera/Book1.csv")

spark-csvoriginally was an external library by databricks, but included in core spark from spark version 2.0 onwards. You can refer to documentation on the library's github pageto find the available options.

spark-csv最初是 databricks 的外部库，但从 spark 2.0 版开始包含在核心 spark 中。您可以参考库的github 页面上的文档以查找可用选项。

Answer 3

回答by Shiv4nsh

What you can do is use a UDF in this case :

在这种情况下，您可以做的是使用 UDF：

Step 1: Make a udf that converts String to Int.

第 1 步：制作一个将 String 转换为 Int 的 udf。

val stringToIntUDF = udf((value:String)=>value.toInt)

Step 2: Apply this UDF to the Column that you want to convert !

第 2 步：将此 UDF 应用于要转换的列！

val updatedDF = df.withColumns("age",stringToIntUDF(df("age")))
updatedDF.printSchema

This should give you your desired result !

这应该会给你你想要的结果！

If you just want to infer your schema from the CSV file. Then @vdep solution seems to be doing the right thing !

如果您只想从 CSV 文件中推断出您的架构。然后@vdep 解决方案似乎在做正确的事情！

val df=spark.read
  .format("org.apache.spark.csv")
  .option("header",true)
  .option("inferSchema", "true") // <-- HERE
  .csv("/home/cloudera/Book1.csv")

如何在不使用 Scala 案例类的情况下为 CSV 文件指定架构？

提问by Ishan Kumar

回答by Jacek Laskowski

cast operator

转换运算符

Using StructType

使用结构类型

Schema DSL

模式 DSL

Encoders

编码器

回答by vdep

回答by Shiv4nsh

相关推荐

最近更新

标签

如何在不使用 Scala 案例类的情况下为 CSV 文件指定架构？

提问by Ishan Kumar

回答by Jacek Laskowski

cast operator

转换运算符

Using StructType

使用结构类型

Schema DSL

模式 DSL

Encoders

编码器

回答by vdep

回答by Shiv4nsh

相关推荐

Scala 中的 toString 函数

scala 如何在 spark 中使用 Regexp_replace

scala Spark：使用Scala在reduceByKey中取平均值而不是总和

scala 选择数组中的一系列元素spark sql

相关推荐

最近更新

标签