Spark：使用 scala 从 s3 读取 csv 文件

Question

提问by Edamame

I am writing a spark job, trying to read a text file using scala, the following works fine on my local machine.

我正在编写一个 spark 作业，尝试使用 Scala 读取文本文件，以下在我的本地机器上运行良好。

  val myFile = "myLocalPath/myFile.csv"
  for (line <- Source.fromFile(myFile).getLines()) {
    val data = line.split(",")
    myHashMap.put(data(0), data(1).toDouble)
  }

Then I tried to make it work on AWS, I did the following, but it didn't seem to read the entire file properly. What should be the proper way to read such text file on s3? Thanks a lot!

然后我尝试让它在 AWS 上运行，我做了以下操作，但它似乎没有正确读取整个文件。在 s3 上读取此类文本文件的正确方法是什么？非常感谢！

val credentials = new BasicAWSCredentials("myKey", "mySecretKey");
val s3Client = new AmazonS3Client(credentials);
val s3Object = s3Client.getObject(new GetObjectRequest("myBucket", "myFile.csv"));

val reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));

var line = ""
while ((line = reader.readLine()) != null) {
      val data = line.split(",")
      myHashMap.put(data(0), data(1).toDouble)
      println(line);
}

Answer 1

采纳答案by Edamame

I think I got it work like below:

我想我的工作如下：

    val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));

    val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
    for (line <- myData) {
        val data = line.split(",")
        myMap.put(data(0), data(1).toDouble)
    }

    println(" my map : " + myMap.toString())

Answer 2

回答by Sarath Avanavu

This can be acheived even withoutout importing amazons3 libraries using SparkContext textfile. Use the below code

即使不使用SparkContext textfile. 使用下面的代码

import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val s3Login = "s3://AccessKey:Securitykey@Externalbucket"
val filePath = s3Login + "/Myfolder/myscv.csv"
for (line <- sc.textFile(filePath).collect())
{
    var data = line.split(",")
    var value1 = data(0)
    var value2 = data(1).toDouble
}

In the above code, sc.textFilewill read the data from your file and store in the lineRDD. It then split each line with ,to a different RDD datainside the loop. Then you can access values from this RDD with the index.

在上面的代码中，sc.textFile将从你的文件中读取数据并存储在lineRDD 中。然后它将每一行拆分为循环内的,不同 RDD data。然后，您可以使用索引访问此 RDD 中的值。

Answer 3

回答by Glennie Helles Sindholt

Read in csv-file with sc.textFile("s3://myBucket/myFile.csv"). That will give you an RDD[String]. Get that into a map

用 .csv 文件读取sc.textFile("s3://myBucket/myFile.csv"). 这会给你一个 RDD[String]。把它放到地图上

val myHashMap = data.collect
                    .map(line => {
                      val substrings = line.split(" ")
                      (substrings(0), substrings(1).toDouble)})
                    .toMap

You can the use sc.broadcastto broadcast your map, so that it is readily available on all your worker nodes.

您可以使用sc.broadcast广播您的地图，以便它在您的所有工作节点上随时可用。

(Note that you can of course also use the Databricks "spark-csv" package to read in the csv-file if you prefer.)

（请注意，如果您愿意，您当然也可以使用 Databricks 的“spark-csv”包来读取 csv 文件。）

Spark：使用 scala 从 s3 读取 csv 文件

提问by Edamame

采纳答案by Edamame

回答by Sarath Avanavu

回答by Glennie Helles Sindholt

相关推荐

最近更新

标签

Spark：使用 scala 从 s3 读取 csv 文件

提问by Edamame

采纳答案by Edamame

回答by Sarath Avanavu

回答by Glennie Helles Sindholt

相关推荐

scala 如何将地图转换为 Spark 的 RDD

scala 如何在 Spark SQL 中定义和使用用户定义的聚合函数？

数据框：如何分组/计数然后在 Scala 中过滤计数

scala Apache Spark：如何将带有正则表达式的数据帧列转换为另一个数据帧？

相关推荐

最近更新

标签