scala Spark-SQL:如何将 TSV 或 CSV 文件读入数据帧并应用自定义架构?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43508054/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark-SQL : How to read a TSV or CSV file into dataframe and apply a custom schema?
提问by stackoverflowuser2010
I'm using Spark 2.0 while working with tab-separated value (TSV) and comma-separated value (CSV) files. I want to load the data into Spark-SQL dataframes, where I would like to control the schema completely when the files are read. I don't want Spark to guess the schema from the data in the file.
我在处理制表符分隔值 (TSV) 和逗号分隔值 (CSV) 文件时使用 Spark 2.0。我想将数据加载到 Spark-SQL 数据帧中,我想在读取文件时完全控制架构。我不希望 Spark 从文件中的数据中猜测模式。
How would I load TSV or CSV files into Spark SQL Dataframes and apply a schema to them?
我如何将 TSV 或 CSV 文件加载到 Spark SQL Dataframes 中并将架构应用于它们?
回答by stackoverflowuser2010
Below is a complete Spark 2.0 example of loading a tab-separated value (TSV) file and applying a schema.
下面是加载制表符分隔值 (TSV) 文件和应用架构的完整 Spark 2.0 示例。
I'm using the Iris data set in TSV format from UAH.eduas an example. Here are the first few rows from that file:
我使用来自 UAH.edu 的 TSV 格式的Iris 数据集作为示例。以下是该文件的前几行:
Type PW PL SW SL
0 2 14 33 50
1 24 56 31 67
1 23 51 31 69
0 2 10 36 46
1 20 52 30 65
To enforce a schema, you can programmatically build it using one of two methods:
要强制执行模式,您可以使用以下两种方法之一以编程方式构建它:
A. Create the schema with StructType:
A. 使用以下命令创建架构StructType:
import org.apache.spark.sql.types._
var irisSchema = StructType(Array(
StructField("Type", IntegerType, true),
StructField("PetalWidth", IntegerType, true),
StructField("PetalLength", IntegerType, true),
StructField("SepalWidth", IntegerType, true),
StructField("SepalLength", IntegerType, true)
))
B. Alternatively, create the schema with a case classand Encoders(this approach is less verbose):
B. 或者,使用case classand创建模式Encoders(这种方法不那么冗长):
import org.apache.spark.sql.Encoders
case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
SepalWidth: Int, SepalLength: Int)
var irisSchema = Encoders.product[IrisSchema].schema
Once you have created your schema, you can use spark.readto read in the TSV file. Note that you can actually also read comma-separated value (CSV) files as well, or any delimited files, as long as you set the option("delimiter", d)option correctly. Further, if you have a data file that has a header line, be sure to set option("header", "true").
创建架构后,您可以使用spark.read读取 TSV 文件。请注意,您实际上也可以读取逗号分隔值 (CSV) 文件或任何分隔文件,只要您option("delimiter", d)正确设置了该选项。此外,如果您有一个带有标题行的数据文件,请务必设置option("header", "true").
Below is the complete final code:
下面是完整的最终代码:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoders
val spark = SparkSession.builder().getOrCreate()
case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
SepalWidth: Int, SepalLength: Int)
var irisSchema = Encoders.product[IrisSchema].schema
var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV.
option("header", "true"). // Does the file have a header line?
option("delimiter", "\t"). // Set delimiter to tab or comma.
schema(irisSchema). // Schema that was built above.
load("iris.tsv")
irisDf.show(5)
And here is the output:
这是输出:
scala> irisDf.show(5)
+----+----------+-----------+----------+-----------+
|Type|PetalWidth|PetalLength|SepalWidth|SepalLength|
+----+----------+-----------+----------+-----------+
| 0| 2| 14| 33| 50|
| 1| 24| 56| 31| 67|
| 1| 23| 51| 31| 69|
| 0| 2| 10| 36| 46|
| 1| 20| 52| 30| 65|
+----+----------+-----------+----------+-----------+
only showing top 5 rows

