Spark:使用 scala 从 s3 读取 csv 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32470705/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark: read csv file from s3 using scala
提问by Edamame
I am writing a spark job, trying to read a text file using scala, the following works fine on my local machine.
我正在编写一个 spark 作业,尝试使用 Scala 读取文本文件,以下在我的本地机器上运行良好。
val myFile = "myLocalPath/myFile.csv"
for (line <- Source.fromFile(myFile).getLines()) {
val data = line.split(",")
myHashMap.put(data(0), data(1).toDouble)
}
Then I tried to make it work on AWS, I did the following, but it didn't seem to read the entire file properly. What should be the proper way to read such text file on s3? Thanks a lot!
然后我尝试让它在 AWS 上运行,我做了以下操作,但它似乎没有正确读取整个文件。在 s3 上读取此类文本文件的正确方法是什么?非常感谢!
val credentials = new BasicAWSCredentials("myKey", "mySecretKey");
val s3Client = new AmazonS3Client(credentials);
val s3Object = s3Client.getObject(new GetObjectRequest("myBucket", "myFile.csv"));
val reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
var line = ""
while ((line = reader.readLine()) != null) {
val data = line.split(",")
myHashMap.put(data(0), data(1).toDouble)
println(line);
}
采纳答案by Edamame
I think I got it work like below:
我想我的工作如下:
val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));
val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
for (line <- myData) {
val data = line.split(",")
myMap.put(data(0), data(1).toDouble)
}
println(" my map : " + myMap.toString())
回答by Sarath Avanavu
This can be acheived even withoutout importing amazons3 libraries using SparkContext textfile. Use the below code
即使不使用SparkContext textfile. 使用下面的代码
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val s3Login = "s3://AccessKey:Securitykey@Externalbucket"
val filePath = s3Login + "/Myfolder/myscv.csv"
for (line <- sc.textFile(filePath).collect())
{
var data = line.split(",")
var value1 = data(0)
var value2 = data(1).toDouble
}
In the above code, sc.textFilewill read the data from your file and store in the lineRDD. It then split each line with ,to a different RDD datainside the loop. Then you can access values from this RDD with the index.
在上面的代码中,sc.textFile将从你的文件中读取数据并存储在lineRDD 中。然后它将每一行拆分为循环内的,不同 RDD data。然后,您可以使用索引访问此 RDD 中的值。
回答by Glennie Helles Sindholt
Read in csv-file with sc.textFile("s3://myBucket/myFile.csv"). That will give you an RDD[String]. Get that into a map
用 .csv 文件读取sc.textFile("s3://myBucket/myFile.csv"). 这会给你一个 RDD[String]。把它放到地图上
val myHashMap = data.collect
.map(line => {
val substrings = line.split(" ")
(substrings(0), substrings(1).toDouble)})
.toMap
You can the use sc.broadcastto broadcast your map, so that it is readily available on all your worker nodes.
您可以使用sc.broadcast广播您的地图,以便它在您的所有工作节点上随时可用。
(Note that you can of course also use the Databricks "spark-csv" package to read in the csv-file if you prefer.)
(请注意,如果您愿意,您当然也可以使用 Databricks 的“spark-csv”包来读取 csv 文件。)

