Python 如何在用 Spark 编写的 PySpark 中读取镶木地板?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42991198/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I read a parquet in PySpark written from Spark?
提问by Ross Lewis
I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:
我正在使用两个 Jupyter 笔记本在分析中做不同的事情。在我的 Scala 笔记本中,我将一些清理过的数据写入 parquet:
partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet")
I then go to my Python notebook to read in the data:
然后我去我的 Python notebook 读入数据:
df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet")
and I get the following error:
我收到以下错误:
AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'
I have looked at the spark documentation and I don't think I should be required to specify a schema. Has anyone run into something like this? Should I be doing something else when I save/load? The data is landing in Object Storage.
我已经查看了 spark 文档,但我认为不需要指定架构。有没有人遇到过这样的事情?保存/加载时我应该做其他事情吗?数据正在登陆对象存储。
edit: I'm sing spark 2.0 in both the read and the write.
编辑:我在读和写中都唱spark 2.0。
edit2: This was done in a project in Data Science Experience.
edit2:这是在 Data Science Experience 的一个项目中完成的。
回答by Jeril
I read parquet file in the following way:
我通过以下方式读取镶木地板文件:
from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
.master('local') \
.appName('myAppName') \
.config('spark.executor.memory', '5gb') \
.config("spark.cores.max", "6") \
.getOrCreate()
sc = spark.sparkContext
# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
# to read parquet file
df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')
回答by himanshuIIITian
You can use parquet
format of Spark Session to read parquet files. Like this:
您可以使用parquet
Spark Session 的格式来读取 parquet 文件。像这样:
df = spark.read.parquet("swift2d://xxxx.keystone/commentClusters.parquet")
Although, there is no difference between parquet
and load
functions. It might be the case that load
is not able to infer the schema of data in the file (eg, some data type which is not identifiable by load
or specific to parquet
).
虽然,parquet
和load
功能之间没有区别。可能load
是无法推断文件中数据模式的情况(例如,某些数据类型无法识别load
或特定于parquet
)。