Python 如何使用 Spark (pyspark) 编写镶木地板文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42022890/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:07:49  来源:igfitidea点击:

How can I write a parquet file using Spark (pyspark)?

pythonpysparkspark-dataframe

提问by ebertbm

I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentationsays that I can use write.parquetfunction to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'

我是 Spark 的新手,我一直在尝试将 Dataframe 转换为 Spark 中的镶木地板文件,但还没有成功。该文件说,我可以使用write.parquet函数来创建该文件。但是,当我运行脚本时,它向我显示:AttributeError: 'RDD' object has no attribute 'write'

from pyspark import SparkContext
sc = SparkContext("local", "Protob Conversion to Parquet ")

# spark is an existing SparkSession
df = sc.textFile("/temp/proto_temp.csv")

# Displays the content of the DataFrame to stdout
df.write.parquet("/output/proto.parquet")

Do you know how to make this work?

你知道如何进行这项工作吗?

The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.

我使用的 spark 版本是为 Hadoop 2.7.3 构建的 Spark 2.0.1。

回答by ebertbm

The error was due to the fact that the textFilemethod from SparkContextreturned an RDDand what I needed was a DataFrame.

该错误是由于该textFile方法SparkContext返回 anRDD而我需要的是DataFrame.

SparkSession has a SQLContextunder the hood. So I needed to use the DataFrameReaderto read the CSV file correctly before converting it to a parquet file.

SparkSession 有一个幕后SQLContext。因此DataFrameReader,在将其转换为镶木地板文件之前,我需要使用正确读取 CSV 文件。

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# read csv
df = spark.read.csv("/temp/proto_temp.csv")

# Displays the content of the DataFrame to stdout
df.show()

df.write.parquet("output/proto.parquet")

回答by Powers

You can also write out Parquet files from Spark with koalas. This library is great for folks that prefer Pandas syntax.

您还可以使用koalas从 Spark 写出 Parquet 文件。这个库非常适合喜欢 Pandas 语法的人。

Here's the Koala code:

这是考拉代码:

import databricks.koalas as ks

df = ks.read_csv('/temp/proto_temp.csv')
df.to_parquet('output/proto.parquet')

Read this blog postif you'd like more details.

如果您想了解更多详细信息,请阅读此博客文章