如何在不使用 Scala 案例类的情况下为 CSV 文件指定架构?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40653813/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to specify schema for CSV file without using Scala case class?
提问by Ishan Kumar
I am loading a CSV file into a DataFrame as below.
我正在将 CSV 文件加载到 DataFrame 中,如下所示。
val conf=new SparkConf().setAppName("dataframes").setMaster("local")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val df = spark.
read.
format("org.apache.spark.csv").
option("header", true).
csv("/home/cloudera/Book1.csv")
scala> df.printSchema()
root
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- age: string (nullable = true)
How to change agecolumn to be of type Int?
如何将age列更改为类型Int?
回答by Jacek Laskowski
Given val spark=SparkSession.builder().getOrCreate()I guessyou're using Spark 2.x.
鉴于val spark=SparkSession.builder().getOrCreate()我猜你正在使用 Spark 2.x。
First of all, please note that Spark 2.x has a native support for CSV format and as such does not require specifying the format by its long name, i.e. org.apache.spark.csv, but just csv.
首先,请注意,星火2.X具有CSV格式的原生支持,因此不要求其长名称指定格式,即org.apache.spark.csv,只是csv。
spark.read.format("csv")...
Since you use csvoperator, the CSV format is implied and so you can skip/remove format("csv").
由于您使用csv运算符,CSV 格式是隐含的,因此您可以跳过/删除format("csv").
// note that I removed format("csv")
spark.read.option("header", true).csv("/home/cloudera/Book1.csv")
With that you have plenty of options, but I strongly recommend using a case class for...just the schema. See the last solution if you're curious how to do it in Spark 2.0.
有了它,你有很多选择,但我强烈建议使用案例类......只是模式。如果您对如何在 Spark 2.0 中执行此操作感到好奇,请参阅最后一个解决方案。
cast operator
转换运算符
You could use castoperator.
您可以使用强制转换运算符。
scala> Seq("1").toDF("str").withColumn("num", 'str cast "int").printSchema
root
|-- str: string (nullable = true)
|-- num: integer (nullable = true)
Using StructType
使用结构类型
You can also use your own hand-crafted schema with StructTypeand StructFieldas follows:
您还可以将自己的手工架构与StructType和StructField 一起使用,如下所示:
import org.apache.spark.sql.types._
val schema = StructType(
StructField("str", StringType, true) ::
StructField("num", IntegerType, true) :: Nil)
scala> schema.printTreeString
root
|-- str: string (nullable = true)
|-- num: integer (nullable = true)
val q = spark.
read.
option("header", true).
schema(schema).
csv("numbers.csv")
scala> q.printSchema
root
|-- str: string (nullable = true)
|-- num: integer (nullable = true)
Schema DSL
模式 DSL
What I found quite interesting lately was so-called Schema DSL. The above schema built using StructTypeand StructFieldcan be re-written as follows:
我最近发现很有趣的是所谓的Schema DSL。上面使用StructType和构建的模式StructField可以重写如下:
import org.apache.spark.sql.types._
val schema = StructType(
$"str".string ::
$"num".int :: Nil)
scala> schema.printTreeString
root
|-- str: string (nullable = true)
|-- num: integer (nullable = true)
// or even
val schema = new StructType().
add($"str".string).
add($"num".int)
scala> schema.printTreeString
root
|-- str: string (nullable = true)
|-- num: integer (nullable = true)
Encoders
编码器
Encoders are so easy to use that it's hard to believe you could not want them, even only to build a schema without dealing with StructType, StructFieldand DataType.
编码器非常易于使用,以至于很难相信您不想要它们,即使只是构建一个架构而不处理StructType,StructField和DataType。
// Define a business object that describes your dataset
case class MyRecord(str: String, num: Int)
// Use Encoders object to create a schema off the business object
import org.apache.spark.sql.Encoders
val schema = Encoders.product[MyRecord].schema
scala> schema.printTreeString
root
|-- str: string (nullable = true)
|-- num: integer (nullable = false)
回答by vdep
There is inferSchemaoption to automatically recognize the type of the variable by:
可以inferSchema选择通过以下方式自动识别变量的类型:
val df=spark.read
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true) // <-- HERE
.csv("/home/cloudera/Book1.csv")
spark-csvoriginally was an external library by databricks, but included in core spark from spark version 2.0 onwards. You can refer to documentation on the library's github pageto find the available options.
spark-csv最初是 databricks 的外部库,但从 spark 2.0 版开始包含在核心 spark 中。您可以参考库的github 页面上的文档以查找可用选项。
回答by Shiv4nsh
What you can do is use a UDF in this case :
在这种情况下,您可以做的是使用 UDF:
Step 1: Make a udf that converts String to Int.
第 1 步:制作一个将 String 转换为 Int 的 udf。
val stringToIntUDF = udf((value:String)=>value.toInt)
Step 2: Apply this UDF to the Column that you want to convert !
第 2 步:将此 UDF 应用于要转换的列!
val updatedDF = df.withColumns("age",stringToIntUDF(df("age")))
updatedDF.printSchema
This should give you your desired result !
这应该会给你你想要的结果!
If you just want to infer your schema from the CSV file. Then @vdep solution seems to be doing the right thing !
如果您只想从 CSV 文件中推断出您的架构。然后@vdep 解决方案似乎在做正确的事情!
val df=spark.read
.format("org.apache.spark.csv")
.option("header",true)
.option("inferSchema", "true") // <-- HERE
.csv("/home/cloudera/Book1.csv")

