scala 在 Spark 中读取 Avro 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45360359/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading Avro File in Spark
提问by Gayatri
I have read an avro file into spark RDD and need to conver that into a sql dataframe. how do I do that.
我已将 avro 文件读入 spark RDD,需要将其转换为 sql 数据帧。我怎么做。
This is what I did so far.
这是我到目前为止所做的。
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper}
import org.apache.hadoop.io.NullWritable
val path = "hdfs://dds-nameservice/user/ghagh/"
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)
When I do:
当我做:
avro.take(1)
I get back
我回来了
res1: Array[(org.apache.avro.mapred.AvroWrapper[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)] = Array(({"column1": "value1", "column2": "value2", "column3": value3,...
How do I convert this to a SparkSQL dataframe?
如何将其转换为 SparkSQL 数据框?
I am using Spark 1.6
我正在使用 Spark 1.6
Can anyone tell me if there is an easy solution around this?
谁能告诉我是否有一个简单的解决方案?
回答by Alper t. Turker
For DataFrameI'd go with Avro data source directly:
Include spark-avro in packages list. For the latest version use:
com.databricks:spark-avro_2.11:3.2.0Load the file:
val df = spark.read .format("com.databricks.spark.avro") .load(path)
在包列表中包含 spark-avro。对于最新版本,请使用:
com.databricks:spark-avro_2.11:3.2.0加载文件:
val df = spark.read .format("com.databricks.spark.avro") .load(path)
回答by Manoj Kumar Dhakad
If your project is maven then add below latest dependency in pom.xml
如果您的项目是 maven,则在 pom.xml 中添加以下最新依赖项
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
After that you can read avrofile like below
之后,您可以读取如下avro文件
val df=spark.read.format("com.databricks.spark.avro").option("header","true").load("C:\Users\alice\inputs\sample_data.avro")

