如何从 Scala Spark 中的 Excel (xls,xlsx) 文件构造数据框?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44196741/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?
提问by ktheitroadalo
I have a large Excel(xlsx and xls)file with multiple sheet and I need convert it to RDDor Dataframeso that it can be joined to other dataframelater. I was thinking of using Apache POIand save it as a CSVand then read csvin dataframe. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.
我有一个Excel(xlsx and xls)包含多个工作表的大文件,我需要将其转换为RDDorDataframe以便以后可以加入其他文件dataframe。我想使用的Apache的POI并将其保存为一个CSV,然后阅读csv在dataframe。但是,如果有任何库或 API 可以帮助此流程,那就很容易了。任何帮助都受到高度赞赏。
回答by Ramesh Maharjan
The solution to your problem is to use Spark Exceldependency in your project.
您的问题的解决方案是Spark Excel在您的项目中使用依赖项。
Spark Excelhas flexible optionsto play with.
Spark Excel可以灵活options使用。
I have tested the following code to read from exceland convert it to dataframeand it just works perfect
我已经测试了以下代码以读取excel并将其转换为它,dataframe并且它运行完美
def readExcel(file: String): DataFrame = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", file)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show(false)
you can give sheetnameas optionif your excel sheet has multiple sheets
你可以给sheetname就像option你的excel表有多个表一样
.option("sheetName", "Sheet2")
I hope its helpful
我希望它有帮助
回答by Ram Ghadiyaram
Here are readand writeexamples to read from and write into excel with full set of options...
以下是使用全套选项读取和写入 excel 的读取和写入示例。..
Source spark-excel from crealytics
Scala API Spark 2.0+:
Scala API Spark 2.0+:
Create a DataFrame from an Excel file
从 Excel 文件创建 DataFrame
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.crealytics.spark.excel")
.option("sheetName", "Daily") // Required
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "false") // Optional, default: true
.option("inferSchema", "false") // Optional, default: false
.option("addColorColumns", "true") // Optional, default: false
.option("startColumn", 0) // Optional, default: 0
.option("endColumn", 99) // Optional, default: Int.MaxValue
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.load("Worktime.xlsx")
Write a DataFrame to an Excel file
将 DataFrame 写入 Excel 文件
df.write
.format("com.crealytics.spark.excel")
.option("sheetName", "Daily")
.option("useHeader", "true")
.option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
.mode("overwrite")
.save("Worktime2.xlsx")
Note: Instead of sheet1 or sheet2 you can use their names as well.. in this example given above Dailyis sheet name.
注意:您也可以使用它们的名称来代替 sheet1 或 sheet2。在上面给出的示例中,Daily是工作表名称。
- If you want to use it from spark shell...
- 如果你想从火花壳中使用它......
This package can be added to Spark using the --packagescommand line option. For example, to include it when starting the spark shell:
可以使用--packages命令行选项将此包添加到 Spark 。例如,在启动 spark shell 时包含它:
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:0.9.8
- Dependencies needs to be added (in case of maven etc...):
- 需要添加依赖项(在 maven 等的情况下...):
groupId: com.crealytics artifactId: spark-excel_2.11 version: 0.9.8
groupId: com.crealytics artifactId: spark-excel_2.11 version: 0.9.8
Tip :This is very useful approach particularly for writing maven test cases where you can place excel sheets with sample data in excel
src/main/resourcesfolder and you can access them in your unit test cases(scala/java), which createsDataFrame[s] out of excel sheet...
提示:这是一种非常有用的方法,特别是对于编写 maven 测试用例,您可以将带有示例数据的 excel 表格放在 excel
src/main/resources文件夹中,您可以在单元测试用例(scala/java)中访问它们,这会DataFrame从 excel 表格中创建[s] ...
- Another option you could consider is spark-hadoopoffice-ds
- 您可以考虑的另一个选择是 spark-hadoopoffice-ds
A Spark datasource for the HadoopOffice library. This Spark datasource assumes at least Spark 2.0.1. However, the HadoopOffice library can also be used directly from Spark 1.x. Currently this datasource supports the following formats of the HadoopOffice library:
Excel Datasource format:
org.zuinnote.spark.office.ExcelLoading and Saving of old Excel (.xls) and new Excel (.xlsx) This datasource is available on Spark-packages.organd on Maven Central.
HadoopOffice 库的 Spark 数据源。此 Spark 数据源至少假定 Spark 2.0.1。但是,HadoopOffice 库也可以直接从 Spark 1.x 使用。目前该数据源支持以下格式的 HadoopOffice 库:
Excel 数据源格式:
org.zuinnote.spark.office.Excel加载和保存旧 Excel (.xls) 和新 Excel (.xlsx) 此数据源可在Spark-packages.org和Maven Central 上找到。
回答by J?rn Franke
Alternatively, you can use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki), which supports also encrypted Excel documents and linked workbooks, amongst other features. Of course Spark is also supported.
或者,您可以使用 HadoopOffice 库 ( https://github.com/ZuInnoTe/hadoopoffice/wiki),该库还支持加密的 Excel 文档和链接的工作簿等功能。当然也支持Spark。
回答by svk 041994
I have used com.crealytics.spark.excel-0.11 version jar and created in spark-Java, it would be the same in scala too, just need to change javaSparkContext to SparkContext.
我已经使用了 com.crealytics.spark.excel-0.11 版本的 jar 并在 spark-Java 中创建,它在 scala 中也是一样的,只需要将 javaSparkContext 更改为 SparkContext。
tempTable = new SQLContext(javaSparkContxt).read()
.format("com.crealytics.spark.excel")
.option("sheetName", "sheet1")
.option("useHeader", "false") // Required
.option("treatEmptyValuesAsNulls","false") // Optional, default: true
.option("inferSchema", "false") //Optional, default: false
.option("addColorColumns", "false") //Required
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff] .schema(schema)
.schema(schema)
.load("hdfs://localhost:8020/user/tester/my.xlsx");
回答by Sakthivel Nachimuthu
Hope this should help.
希望这应该会有所帮助。
val df_excel= spark.read.
format("com.crealytics.spark.excel").
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "false").
option("inferSchema", "false").
option("addColorColumns", "false").load(file_path)
display(df_excel)

