scala 如何在 Spark shell 中将 s3 与 Apache spark 2.2 一起使用

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45756554/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:24:27  来源:igfitidea点击:

How to use s3 with Apache spark 2.2 in the Spark shell

scalaapache-sparkamazon-s3

提问by Shafique Jamal

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.

我正在尝试在 Spark shell 中从 Amazon AWS S3 存储桶加载数据。

I have consulted the following resources:

我查阅了以下资源:

Parsing files from Amazon S3 with Apache Spark

使用 Apache Spark 解析来自 Amazon S3 的文件

How to access s3a:// files from Apache Spark?

如何从 Apache Spark 访问 s3a:// 文件?

Hortonworks Spark 1.6 and S3

Hortonworks Spark 1.6 和 S3

Cloudera

云时代

Custom s3 endpoints

自定义 s3 端点

I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaultsI have the following (note I replaced access-keyand secret-key):

我已经下载并解压了Apache Spark 2.2.0。在conf/spark-defaults我有以下内容(注意我替换了access-keysecret-key):

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key 
spark.hadoop.fs.s3a.secret.key=secret-key

I have downloaded hadoop-aws-2.8.1.jarand aws-java-sdk-1.11.179.jarfrom mvnrepository, and placed them in the jars/directory. I then start the Spark shell:

我已经从mvnrepository下载hadoop-aws-2.8.1.jar和,并将它们放在目录中。然后我启动 Spark shell:aws-java-sdk-1.11.179.jarjars/

bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar

In the shell, here is how I try to load data from the S3 bucket:

在 shell 中,这是我尝试从 S3 存储桶加载数据的方法:

val p = spark.read.textFile("s3a://sparkcookbook/person")

And here is the error that results:

这是导致的错误:

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
  at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
  at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)

When I instead try to start the Spark shell as follows:

当我尝试按如下方式启动 Spark shell 时:

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1

Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:

然后我收到两个错误:一个是在 interperter 启动时,另一个是在我尝试加载数据时。这是第一个:

:: problems summary ::
:::: ERRORS
    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

And here is the second:

这是第二个:

val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
  at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
  at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)

Could someone suggest how to get this working? Thanks.

有人可以建议如何让这个工作吗?谢谢。

回答by himanshuIIITian

If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jarand aws-java-sdk-1.7.4.jar.

如果您使用的是 Apache Spark 2.2.0,那么您应该使用hadoop-aws-2.7.3.jaraws-java-sdk-1.7.4.jar

$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar

After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.

之后,当您尝试从 shell 中的 S3 存储桶加载数据时,您将能够这样做。