Python 如何在用 Spark 编写的 PySpark 中读取镶木地板？

Question

提问by Ross Lewis

I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:

我正在使用两个 Jupyter 笔记本在分析中做不同的事情。在我的 Scala 笔记本中，我将一些清理过的数据写入 parquet：

partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet")

I then go to my Python notebook to read in the data:

然后我去我的 Python notebook 读入数据：

df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet")

and I get the following error:

我收到以下错误：

AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'

I have looked at the spark documentation and I don't think I should be required to specify a schema. Has anyone run into something like this? Should I be doing something else when I save/load? The data is landing in Object Storage.

我已经查看了 spark 文档，但我认为不需要指定架构。有没有人遇到过这样的事情？保存/加载时我应该做其他事情吗？数据正在登陆对象存储。

edit: I'm sing spark 2.0 in both the read and the write.

编辑：我在读和写中都唱spark 2.0。

edit2: This was done in a project in Data Science Experience.

edit2：这是在 Data Science Experience 的一个项目中完成的。

Answer 1

回答by Jeril

I read parquet file in the following way:

我通过以下方式读取镶木地板文件：

from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
    .master('local') \
    .appName('myAppName') \
    .config('spark.executor.memory', '5gb') \
    .config("spark.cores.max", "6") \
    .getOrCreate()

sc = spark.sparkContext

# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# to read parquet file
df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')

Answer 2

回答by himanshuIIITian

You can use parquetformat of Spark Session to read parquet files. Like this:

您可以使用parquetSpark Session 的格式来读取 parquet 文件。像这样：

df = spark.read.parquet("swift2d://xxxx.keystone/commentClusters.parquet")

Although, there is no difference between parquetand loadfunctions. It might be the case that loadis not able to infer the schema of data in the file (eg, some data type which is not identifiable by loador specific to parquet).

虽然，parquet和load功能之间没有区别。可能load是无法推断文件中数据模式的情况（例如，某些数据类型无法识别load或特定于parquet）。

Python 如何在用 Spark 编写的 PySpark 中读取镶木地板？

提问by Ross Lewis

回答by Jeril

回答by himanshuIIITian

相关推荐

最近更新

标签

Python 如何在用 Spark 编写的 PySpark 中读取镶木地板？

提问by Ross Lewis

回答by Jeril

回答by himanshuIIITian

相关推荐

Python/Json：期望用双引号括起来的属性名称

Python 导入错误：缺少必需的依赖项 ['numpy']

Python PyCharm 加载包列表时出错

Python 迭代工作表、行、列

相关推荐

最近更新

标签