scala 如何在 Spark shell 中将 s3 与 Apache spark 2.2 一起使用

Question

提问by Shafique Jamal

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.

我正在尝试在 Spark shell 中从 Amazon AWS S3 存储桶加载数据。

I have consulted the following resources:

我查阅了以下资源：

Parsing files from Amazon S3 with Apache Spark

使用 Apache Spark 解析来自 Amazon S3 的文件

How to access s3a:// files from Apache Spark?

如何从 Apache Spark 访问 s3a:// 文件？

Hortonworks Spark 1.6 and S3

Hortonworks Spark 1.6 和 S3

I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaultsI have the following (note I replaced access-keyand secret-key):

我已经下载并解压了Apache Spark 2.2.0。在conf/spark-defaults我有以下内容（注意我替换了access-key和secret-key）：

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key 
spark.hadoop.fs.s3a.secret.key=secret-key

I have downloaded hadoop-aws-2.8.1.jarand aws-java-sdk-1.11.179.jarfrom mvnrepository, and placed them in the jars/directory. I then start the Spark shell:

我已经从mvnrepository下载hadoop-aws-2.8.1.jar和，并将它们放在目录中。然后我启动 Spark shell：aws-java-sdk-1.11.179.jarjars/

bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar

In the shell, here is how I try to load data from the S3 bucket:

在 shell 中，这是我尝试从 S3 存储桶加载数据的方法：

val p = spark.read.textFile("s3a://sparkcookbook/person")

And here is the error that results:

这是导致的错误：

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
  at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
  at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)

When I instead try to start the Spark shell as follows:

当我尝试按如下方式启动 Spark shell 时：

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1

Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:

然后我收到两个错误：一个是在 interperter 启动时，另一个是在我尝试加载数据时。这是第一个：

:: problems summary ::
:::: ERRORS
    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

And here is the second:

这是第二个：

val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
  at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
  at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)

Could someone suggest how to get this working? Thanks.

有人可以建议如何让这个工作吗？谢谢。

Answer 1

回答by himanshuIIITian

If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jarand aws-java-sdk-1.7.4.jar.

如果您使用的是 Apache Spark 2.2.0，那么您应该使用hadoop-aws-2.7.3.jar和aws-java-sdk-1.7.4.jar。

$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar

After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.

之后，当您尝试从 shell 中的 S3 存储桶加载数据时，您将能够这样做。

scala 如何在 Spark shell 中将 s3 与 Apache spark 2.2 一起使用

提问by Shafique Jamal

回答by himanshuIIITian

相关推荐

最近更新

标签

scala 如何在 Spark shell 中将 s3 与 Apache spark 2.2 一起使用

提问by Shafique Jamal

回答by himanshuIIITian

相关推荐

scala spark sql cast 函数创建带有 NULLS 的列

scala Spark：将字符串列转换为数组

scala 为什么 Spark 会因“检测到逻辑计划之间的内部连接的笛卡尔积”而失败？

scala 中 Apache Spark 中不支持的文字类型类

相关推荐

最近更新

标签