如何从 S3 读取拼花数据以触发数据帧 Python?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44629156/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:15:50  来源:igfitidea点击:

How to read parquet data from S3 to spark dataframe Python?

pythonapache-sparkamazon-s3pyspark

提问by Viv

I am new to Spark and I am not able to find this... I have a lot of parquet files uploaded into s3at location :

我是 Spark 的新手,但我找不到这个……我s3在以下位置上传了很多镶木地板文件:

s3://a-dps/d-l/sco/alpha/20160930/parquet/

The total size of this folder is 20+ Gb,. How to chunk and read this into a dataframe How to load all these files into a dataframe?

此文件夹的总大小为20+ Gb,。如何将其分块并将其读入数据帧 如何将所有这些文件加载​​到数据帧中?

Allocated memory to spark cluster is 6 gb.

分配给 Spark 集群的内存为 6 GB。

    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    from pyspark import SparkConf
    from pyspark.sql import SparkSession
    import pandas
    # SparkConf().set("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.0.0-alpha3")
    sc = SparkContext.getOrCreate()

    sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", 'A')
    sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 's')

    sqlContext = SQLContext(sc)
    df2 = sqlContext.read.parquet("s3://sm/data/scor/alpha/2016/parquet/*")

Error :

错误 :

    Py4JJavaError: An error occurred while calling o33.parquet.
    : java.io.IOException: No FileSystem for scheme: s3
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
        at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun.apply(DataSource.scala:372)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun.apply(DataSource.scala:370)
        at scala.collection.TraversableLike$$anonfun$flatMap.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:344)

 

回答by eliasah

The file schema (s3)that you are using is not correct. You'll need to use the s3nschema or s3a(for bigger s3 objects):

s3您使用的文件架构 ( ) 不正确。您需要使用s3n架构或s3a(对于更大的 s3 对象):

// use sqlContext instead for spark <2 
val df = spark.read 
              .load("s3n://bucket-name/object-path")

I suggest that you read more about the Hadoop-AWS module: Integration with Amazon Web Services Overview.

我建议您阅读有关Hadoop-AWS 模块的更多信息:与 Amazon Web Services 的集成概述

回答by Artem Ignatiev

You've to use SparkSession instead of sqlContext since Spark 2.0

自 Spark 2.0 以来,您必须使用 SparkSession 而不是 sqlContext

spark = SparkSession.builder
                        .master("local")             
                        .appName("app name")             
                        .config("spark.some.config.option", true).getOrCreate()

df = spark.read.parquet("s3://path/to/parquet/file.parquet")