scala 直接从 Spark shell 读取 ORC 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30792494/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:15:09  来源:igfitidea点击:

Read ORC files directly from Spark shell

scalahadoopapache-sparkhivepyspark

提问by mslick3

I am having issues reading an ORC file directly from the Spark shell. Note: running Hadoop 1.2, and Spark 1.2, using pyspark shell, can use spark-shell (runs scala).

我在直接从 Spark shell 读取 ORC 文件时遇到问题。注意:运行 Hadoop 1.2 和 Spark 1.2,使用 pyspark shell,可以使用 spark-shell(运行 scala)。

I have used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html.

我已经使用了这个资源http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html

from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)

inputRead = sc.hadoopFile("hdfs://user@server:/file_path",
classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],
classOf[outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat])

I get an error generally saying wrong syntax. One time, the code seemed to work, I used just the 1st of three arguments passed to hadoopFile, but when I tried to use

我收到一个错误,通常说语法错误。有一次,代码似乎有效,我只使用了传递给 hadoopFile 的三个参数中的第一个,但是当我尝试使用

inputRead.first()

the output was RDD[nothing, nothing]. I don't know if this is because the inputRead variable did not get created as an RDD or if it was not created at all.

输出是 RDD [没有,没有]。我不知道这是因为 inputRead 变量没有被创建为 RDD 还是根本没有创建。

I appreciate any help!

我感谢任何帮助!

回答by Sudheer Palyam

In Spark 1.5, I'm able to load my ORC file as:

在 Spark 1.5 中,我可以将我的 ORC 文件加载为:

val orcfile = "hdfs:///ORC_FILE_PATH"
val df = sqlContext.read.format("orc").load(orcfile)
df.show

回答by Suman M

You can try this code, it's working for me.

你可以试试这个代码,它对我有用。

val LoadOrc = spark.read.option("inferSchema", true).orc("filepath")
LoadOrc.show()

回答by UserszrKs

you can also add the multiple path to read from

您还可以添加要读取的多个路径

val df = sqlContext.read.format("orc").load("hdfs://localhost:8020/user/aks/input1/*","hdfs://localhost:8020/aks/input2/*/part-r-*.orc")