scala Spark-SQL：如何将 TSV 或 CSV 文件读入数据帧并应用自定义架构？

Question

提问by stackoverflowuser2010

I'm using Spark 2.0 while working with tab-separated value (TSV) and comma-separated value (CSV) files. I want to load the data into Spark-SQL dataframes, where I would like to control the schema completely when the files are read. I don't want Spark to guess the schema from the data in the file.

我在处理制表符分隔值 (TSV) 和逗号分隔值 (CSV) 文件时使用 Spark 2.0。我想将数据加载到 Spark-SQL 数据帧中，我想在读取文件时完全控制架构。我不希望 Spark 从文件中的数据中猜测模式。

How would I load TSV or CSV files into Spark SQL Dataframes and apply a schema to them?

我如何将 TSV 或 CSV 文件加载到 Spark SQL Dataframes 中并将架构应用于它们？

Answer 1

回答by stackoverflowuser2010

Below is a complete Spark 2.0 example of loading a tab-separated value (TSV) file and applying a schema.

下面是加载制表符分隔值 (TSV) 文件和应用架构的完整 Spark 2.0 示例。

I'm using the Iris data set in TSV format from UAH.eduas an example. Here are the first few rows from that file:

我使用来自 UAH.edu 的 TSV 格式的Iris 数据集作为示例。以下是该文件的前几行：

Type    PW      PL      SW      SL
0       2       14      33      50
1       24      56      31      67
1       23      51      31      69
0       2       10      36      46
1       20      52      30      65

To enforce a schema, you can programmatically build it using one of two methods:

要强制执行模式，您可以使用以下两种方法之一以编程方式构建它：

A. Create the schema with StructType:

A. 使用以下命令创建架构StructType：

import org.apache.spark.sql.types._

var irisSchema = StructType(Array(
    StructField("Type",         IntegerType, true),
    StructField("PetalWidth",   IntegerType, true),
    StructField("PetalLength",  IntegerType, true),
    StructField("SepalWidth",   IntegerType, true),
    StructField("SepalLength",  IntegerType, true)
    ))

B. Alternatively, create the schema with a case classand Encoders(this approach is less verbose):

B. 或者，使用case classand创建模式Encoders（这种方法不那么冗长）：

import org.apache.spark.sql.Encoders

case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int, 
                      SepalWidth: Int, SepalLength: Int)

var irisSchema = Encoders.product[IrisSchema].schema

Once you have created your schema, you can use spark.readto read in the TSV file. Note that you can actually also read comma-separated value (CSV) files as well, or any delimited files, as long as you set the option("delimiter", d)option correctly. Further, if you have a data file that has a header line, be sure to set option("header", "true").

创建架构后，您可以使用spark.read读取 TSV 文件。请注意，您实际上也可以读取逗号分隔值 (CSV) 文件或任何分隔文件，只要您option("delimiter", d)正确设置了该选项。此外，如果您有一个带有标题行的数据文件，请务必设置option("header", "true").

Below is the complete final code:

下面是完整的最终代码：

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoders

val spark = SparkSession.builder().getOrCreate()

case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
                      SepalWidth: Int, SepalLength: Int)

var irisSchema = Encoders.product[IrisSchema].schema

var irisDf = spark.read.format("csv").     // Use "csv" regardless of TSV or CSV.
                option("header", "true").  // Does the file have a header line?
                option("delimiter", "\t"). // Set delimiter to tab or comma.
                schema(irisSchema).        // Schema that was built above.
                load("iris.tsv")

irisDf.show(5)

And here is the output:

这是输出：

scala> irisDf.show(5)
+----+----------+-----------+----------+-----------+
|Type|PetalWidth|PetalLength|SepalWidth|SepalLength|
+----+----------+-----------+----------+-----------+
|   0|         2|         14|        33|         50|
|   1|        24|         56|        31|         67|
|   1|        23|         51|        31|         69|
|   0|         2|         10|        36|         46|
|   1|        20|         52|        30|         65|
+----+----------+-----------+----------+-----------+
only showing top 5 rows

scala Spark-SQL：如何将 TSV 或 CSV 文件读入数据帧并应用自定义架构？

提问by stackoverflowuser2010

回答by stackoverflowuser2010

相关推荐

最近更新

标签

scala Spark-SQL：如何将 TSV 或 CSV 文件读入数据帧并应用自定义架构？

提问by stackoverflowuser2010

回答by stackoverflowuser2010

相关推荐

如何在 Scala 中使用 Circe 解码 JSON 列表/数组

在 Spark/Scala 中写入 HDFS 读取 zip 文件

scala 如何通过键或过滤器（）使用带有两个 RDD 的火花交叉点（）？

scala Spark：分解结构的数据帧数组并附加 id

相关推荐

最近更新

标签