scala 无法将 Spark SQL DataFrame 写入 S3

Question

提问by Akki

I have installed spark 2.0 on EC2 & I am using SparkSQL using Scala to retrieve records from DB2 & I want to write to S3, where I am passing access keys to the Spark Context..Following is my code :

我已经在 EC2 上安装了 spark 2.0 & 我正在使用 SparkSQL 使用 Scala 从 DB2 检索记录 & 我想写入 S3，在那里我将访问密钥传递给 Spark 上下文..以下是我的代码：

val df = sqlContext.read.format("jdbc").options(Map( "url" -> , "user" -> usernmae, "password" -> password, "dbtable" -> tablename, "driver" -> "com.ibm.db2.jcc.DB2Driver")).option("query", "SELECT * from tablename limit 10").load()
df.write.save("s3n://data-analytics/spark-db2/data.csv")

And it is throwing following exception :

它抛出以下异常：

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>1E77C38FA2DB34DA</RequestId><HostId>V4O9sdlbHwfXNFtoQ+Y1XYiPvIL2nTs2PIye5JBqiskMW60yDhHhnBoCHPDxLnTPFuzyKGh1gvM=</HostId></Error>
Caused by: org.jets3t.service.S3ServiceException: Service Error Message.
  at org.jets3t.service.S3Service.putObject(S3Service.java:2358)
  at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeEmptyFile(Jets3tNativeFileSystemStore.java:162)

What is the exact problem occurring here as I am passing the Access Keys also to Sparkcontext ?? Any other way to write to S3??

当我将访问密钥也传递给 Sparkcontext 时，这里发生的确切问题是什么？还有其他写入S3的方法吗？？

Answer 1

回答by Tony Fraser

After you get your keys, this is how to write out to s3 in scala/spark2 on s3n.

获得密钥后，这就是如何在 s3n 上的 scala/spark2 中写出到 s3。

spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "[access key]")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "[secret key]")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

df.write
.mode("overwrite")
.parquet("s3n://bucket/folder/parquet/myFile")

This is how to do it with s3a, which is preferred.

这是使用 s3a 的方法，这是首选。

spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "[access key]")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "[secret key]")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

df.write
.mode("overwrite")
.parquet("s3a://bucket/folder/parquet/myFile")

See this postto understand the differences between s3, s3n, and s3a.

请参阅此帖子以了解 s3、s3n 和 s3a 之间的差异。

Answer 2

回答by Kristian

When you create an EC2 instance or an EMR cluster on AWS, you have the option during the creation process to attach an IAM role to that instance or cluster.

在 AWS 上创建 EC2 实例或 EMR 集群时，您可以在创建过程中选择将 IAM 角色附加到该实例或集群。

By default, an EC2 instance is not allowed to connect to S3. You'd need to make a role, and attach it to the instance first.

默认情况下，不允许 EC2 实例连接到 S3。您需要创建一个角色，然后先将其附加到实例。

The purpose of attaching an IAM role is that an IAM role can be given permissions to use various other AWS services without the need for installing physical credentials on that instance. Given there was a access denied error, I assume that the instance doesn't have an IAM role attached to it with the sufficient permissions required to write to S3.

附加 IAM 角色的目的是可以授予 IAM 角色使用各种其他 AWS 服务的权限，而无需在该实例上安装物理凭证。鉴于存在访问被拒绝错误，我假设该实例没有附加到具有写入 S3 所需的足够权限的 IAM 角色。

Here's how you create a new IAM role:

以下是创建新 IAM 角色的方法：

Navigate to the AWS Identity and Access Management (IAM) page.
click on Roles, create a new one.
Search for S3 in the search bar, and then select S3FullAccess (... or something that looks like that, I can't remember it off the top of my head)
Add whatever other services you want that role to have, too.
Save it.

导航到 AWS Identity and Access Management (IAM) 页面。
单击角色，创建一个新角色。
在搜索栏中搜索 S3，然后选择 S3FullAccess（...或类似的东西，我想不起来了）
添加您希望该角色拥有的任何其他服务。
保存。

For a regular old single EC2 instance, click create a new instance:

对于常规旧的单个 EC2 实例，单击创建新实例：

and in the page of the instance creation steps, where you choose the VPC, and subnet, there is a selectbox for IAM role, click that and choose your newly created role.
continue and create your instance as you did before. Now that instance has the permissions to write to S3. voila!

在实例创建步骤的页面中，您可以在其中选择 VPC 和子网，有一个 IAM 角色选择框，单击它并选择您新创建的角色。
继续并像以前一样创建您的实例。现在该实例具有写入 S3 的权限。瞧！

For an EMR cluster:

对于 EMR 集群：

create your EMR cluster, and then navigate to the GUI page where you see your new cluster's details. Find the area on the right that says EMR Role, and then go find that role in your IAM area, and edit it by adding the S3 full permissions.
Save your changes.

创建您的 EMR 集群，然后导航到 GUI 页面，您可以在其中查看新集群的详细信息。找到右侧显示EMR Role 的区域，然后在您的 IAM 区域中找到该角色，并通过添加 S3 完整权限对其进行编辑。
保存您的更改。

Answer 3

回答by hitttt

You may try this

你可以试试这个

df.write.mode("append").format("csv").save("path/to/s3/bucket");

scala 无法将 Spark SQL DataFrame 写入 S3

提问by Akki

回答by Tony Fraser

回答by Kristian

回答by hitttt

相关推荐

最近更新

标签

scala 无法将 Spark SQL DataFrame 写入 S3

提问by Akki

回答by Tony Fraser

回答by Kristian

回答by hitttt

相关推荐

scala 如何在 Spark 窗口函数中以降序使用 orderby()？

scala 如何命名聚合列？

Spark，在 Scala 中添加具有相同值的新列

scala 在 spark 1.6 中将 csv 读取为数据帧

相关推荐

最近更新

标签