scala 如何在 Spark shell 中将 s3 与 Apache spark 2.2 一起使用
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45756554/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use s3 with Apache spark 2.2 in the Spark shell
提问by Shafique Jamal
I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.
我正在尝试在 Spark shell 中从 Amazon AWS S3 存储桶加载数据。
I have consulted the following resources:
我查阅了以下资源:
Parsing files from Amazon S3 with Apache Spark
使用 Apache Spark 解析来自 Amazon S3 的文件
How to access s3a:// files from Apache Spark?
如何从 Apache Spark 访问 s3a:// 文件?
I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaultsI have the following (note I replaced access-keyand secret-key):
我已经下载并解压了Apache Spark 2.2.0。在conf/spark-defaults我有以下内容(注意我替换了access-key和secret-key):
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key
spark.hadoop.fs.s3a.secret.key=secret-key
I have downloaded hadoop-aws-2.8.1.jarand aws-java-sdk-1.11.179.jarfrom mvnrepository, and placed them in the jars/directory. I then start the Spark shell:
我已经从mvnrepository下载hadoop-aws-2.8.1.jar和,并将它们放在目录中。然后我启动 Spark shell:aws-java-sdk-1.11.179.jarjars/
bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar
In the shell, here is how I try to load data from the S3 bucket:
在 shell 中,这是我尝试从 S3 存储桶加载数据的方法:
val p = spark.read.textFile("s3a://sparkcookbook/person")
And here is the error that results:
这是导致的错误:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
When I instead try to start the Spark shell as follows:
当我尝试按如下方式启动 Spark shell 时:
bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1
Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:
然后我收到两个错误:一个是在 interperter 启动时,另一个是在我尝试加载数据时。这是第一个:
:: problems summary ::
:::: ERRORS
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
And here is the second:
这是第二个:
val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)
Could someone suggest how to get this working? Thanks.
有人可以建议如何让这个工作吗?谢谢。
回答by himanshuIIITian
If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jarand aws-java-sdk-1.7.4.jar.
如果您使用的是 Apache Spark 2.2.0,那么您应该使用hadoop-aws-2.7.3.jar和aws-java-sdk-1.7.4.jar。
$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar
After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.
之后,当您尝试从 shell 中的 S3 存储桶加载数据时,您将能够这样做。

