如何从 Scala Spark 中的 Excel (xls,xlsx) 文件构造数据框？

Question

提问by ktheitroadalo

I have a large Excel(xlsx and xls)file with multiple sheet and I need convert it to RDDor Dataframeso that it can be joined to other dataframelater. I was thinking of using Apache POIand save it as a CSVand then read csvin dataframe. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.

我有一个Excel(xlsx and xls)包含多个工作表的大文件，我需要将其转换为RDDorDataframe以便以后可以加入其他文件dataframe。我想使用的Apache的POI并将其保存为一个CSV，然后阅读csv在dataframe。但是，如果有任何库或 API 可以帮助此流程，那就很容易了。任何帮助都受到高度赞赏。

Answer 1

回答by Ramesh Maharjan

The solution to your problem is to use Spark Exceldependency in your project.

您的问题的解决方案是Spark Excel在您的项目中使用依赖项。

Spark Excelhas flexible optionsto play with.

Spark Excel可以灵活options使用。

I have tested the following code to read from exceland convert it to dataframeand it just works perfect

我已经测试了以下代码以读取excel并将其转换为它，dataframe并且它运行完美

def readExcel(file: String): DataFrame = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("location", file)
    .option("useHeader", "true")
    .option("treatEmptyValuesAsNulls", "true")
    .option("inferSchema", "true")
    .option("addColorColumns", "False")
    .load()

val data = readExcel("path to your excel file")

data.show(false)

you can give sheetnameas optionif your excel sheet has multiple sheets

你可以给sheetname就像option你的excel表有多个表一样

.option("sheetName", "Sheet2")

I hope its helpful

我希望它有帮助

Answer 2

回答by Ram Ghadiyaram

Here are readand writeexamples to read from and write into excel with full set of options...

以下是使用全套选项读取和写入 excel 的读取和写入示例。..

Source spark-excel from crealytics

来源火花的Excel从crealytics

Scala API Spark 2.0+:

Scala API Spark 2.0+：

Create a DataFrame from an Excel file

从 Excel 文件创建 DataFrame

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("sheetName", "Daily") // Required
    .option("useHeader", "true") // Required
    .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
    .option("inferSchema", "false") // Optional, default: false
    .option("addColorColumns", "true") // Optional, default: false
    .option("startColumn", 0) // Optional, default: 0
    .option("endColumn", 99) // Optional, default: Int.MaxValue
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
    .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
    .load("Worktime.xlsx")

Write a DataFrame to an Excel file

将 DataFrame 写入 Excel 文件

df.write
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Daily")
  .option("useHeader", "true")
  .option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
  .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
  .mode("overwrite")
  .save("Worktime2.xlsx")

Note: Instead of sheet1 or sheet2 you can use their names as well.. in this example given above Dailyis sheet name.

注意：您也可以使用它们的名称来代替 sheet1 或 sheet2。在上面给出的示例中，Daily是工作表名称。

If you want to use it from spark shell...

如果你想从火花壳中使用它......

This package can be added to Spark using the --packagescommand line option. For example, to include it when starting the spark shell:

可以使用--packages命令行选项将此包添加到 Spark 。例如，在启动 spark shell 时包含它：

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:0.9.8

Dependencies needs to be added (in case of maven etc...):

需要添加依赖项（在 maven 等的情况下...）：

groupId: com.crealytics
artifactId: spark-excel_2.11
version: 0.9.8

groupId: com.crealytics
artifactId: spark-excel_2.11
version: 0.9.8

Tip :This is very useful approach particularly for writing maven test cases where you can place excel sheets with sample data in excel src/main/resourcesfolder and you can access them in your unit test cases(scala/java), which creates DataFrame[s] out of excel sheet...

提示：这是一种非常有用的方法，特别是对于编写 maven 测试用例，您可以将带有示例数据的 excel 表格放在 excel src/main/resources文件夹中，您可以在单元测试用例（scala/java）中访问它们，这会DataFrame从 excel 表格中创建[s] ...

Another option you could consider is spark-hadoopoffice-ds

您可以考虑的另一个选择是 spark-hadoopoffice-ds

A Spark datasource for the HadoopOffice library. This Spark datasource assumes at least Spark 2.0.1. However, the HadoopOffice library can also be used directly from Spark 1.x. Currently this datasource supports the following formats of the HadoopOffice library:
Excel Datasource format: org.zuinnote.spark.office.ExcelLoading and Saving of old Excel (.xls) and new Excel (.xlsx) This datasource is available on Spark-packages.organd on Maven Central.

HadoopOffice 库的 Spark 数据源。此 Spark 数据源至少假定 Spark 2.0.1。但是，HadoopOffice 库也可以直接从 Spark 1.x 使用。目前该数据源支持以下格式的 HadoopOffice 库：
Excel 数据源格式：org.zuinnote.spark.office.Excel加载和保存旧 Excel (.xls) 和新 Excel (.xlsx) 此数据源可在Spark-packages.org和Maven Central 上找到。

Answer 3

回答by J?rn Franke

Alternatively, you can use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki), which supports also encrypted Excel documents and linked workbooks, amongst other features. Of course Spark is also supported.

或者，您可以使用 HadoopOffice 库 ( https://github.com/ZuInnoTe/hadoopoffice/wiki)，该库还支持加密的 Excel 文档和链接的工作簿等功能。当然也支持Spark。

Answer 4

回答by svk 041994

I have used com.crealytics.spark.excel-0.11 version jar and created in spark-Java, it would be the same in scala too, just need to change javaSparkContext to SparkContext.

我已经使用了 com.crealytics.spark.excel-0.11 版本的 jar 并在 spark-Java 中创建，它在 scala 中也是一样的，只需要将 javaSparkContext 更改为 SparkContext。

tempTable = new SQLContext(javaSparkContxt).read()
    .format("com.crealytics.spark.excel") 
    .option("sheetName", "sheet1")
    .option("useHeader", "false") // Required 
    .option("treatEmptyValuesAsNulls","false") // Optional, default: true 
    .option("inferSchema", "false") //Optional, default: false 
    .option("addColorColumns", "false") //Required
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff] .schema(schema)
    .schema(schema)
    .load("hdfs://localhost:8020/user/tester/my.xlsx");

Answer 5

回答by Sakthivel Nachimuthu

Hope this should help.

希望这应该会有所帮助。

val df_excel= spark.read.
                   format("com.crealytics.spark.excel").
                   option("useHeader", "true").
                   option("treatEmptyValuesAsNulls", "false").
                   option("inferSchema", "false"). 
                   option("addColorColumns", "false").load(file_path)

display(df_excel)

如何从 Scala Spark 中的 Excel (xls,xlsx) 文件构造数据框？

提问by ktheitroadalo

回答by Ramesh Maharjan

回答by Ram Ghadiyaram

回答by J?rn Franke

回答by svk 041994

回答by Sakthivel Nachimuthu

相关推荐

最近更新

标签

如何从 Scala Spark 中的 Excel (xls,xlsx) 文件构造数据框？

提问by ktheitroadalo

回答by Ramesh Maharjan

回答by Ram Ghadiyaram

回答by J?rn Franke

回答by svk 041994

回答by Sakthivel Nachimuthu

相关推荐

scala 在 Intellij 中运行 Spark 时出错：“object apache is not a member of package org”

scala 左反加入Spark？

scala 如何将 java.sql.Timestamp 转换为 java.time.OffsetDateTime？

从特定列 scala spark 数据框中获取最小值和最大值

相关推荐

最近更新

标签