Python 如何使用 Spark (pyspark) 编写镶木地板文件？

Question

提问by ebertbm

I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentationsays that I can use write.parquetfunction to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'

我是 Spark 的新手，我一直在尝试将 Dataframe 转换为 Spark 中的镶木地板文件，但还没有成功。该文件说，我可以使用write.parquet函数来创建该文件。但是，当我运行脚本时，它向我显示：AttributeError: 'RDD' object has no attribute 'write'

from pyspark import SparkContext
sc = SparkContext("local", "Protob Conversion to Parquet ")

# spark is an existing SparkSession
df = sc.textFile("/temp/proto_temp.csv")

# Displays the content of the DataFrame to stdout
df.write.parquet("/output/proto.parquet")

Do you know how to make this work?

你知道如何进行这项工作吗？

The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.

我使用的 spark 版本是为 Hadoop 2.7.3 构建的 Spark 2.0.1。

Answer 1

回答by ebertbm

The error was due to the fact that the textFilemethod from SparkContextreturned an RDDand what I needed was a DataFrame.

该错误是由于该textFile方法SparkContext返回 anRDD而我需要的是DataFrame.

SparkSession has a SQLContextunder the hood. So I needed to use the DataFrameReaderto read the CSV file correctly before converting it to a parquet file.

SparkSession 有一个幕后SQLContext。因此DataFrameReader，在将其转换为镶木地板文件之前，我需要使用正确读取 CSV 文件。

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# read csv
df = spark.read.csv("/temp/proto_temp.csv")

# Displays the content of the DataFrame to stdout
df.show()

df.write.parquet("output/proto.parquet")

Answer 2

回答by Powers

You can also write out Parquet files from Spark with koalas. This library is great for folks that prefer Pandas syntax.

您还可以使用koalas从 Spark 写出 Parquet 文件。这个库非常适合喜欢 Pandas 语法的人。

Here's the Koala code:

这是考拉代码：

import databricks.koalas as ks

df = ks.read_csv('/temp/proto_temp.csv')
df.to_parquet('output/proto.parquet')

Read this blog postif you'd like more details.

如果您想了解更多详细信息，请阅读此博客文章。

Python 如何使用 Spark (pyspark) 编写镶木地板文件？

提问by ebertbm

回答by ebertbm

回答by Powers

相关推荐

最近更新

标签

Python 如何使用 Spark (pyspark) 编写镶木地板文件？

提问by ebertbm

回答by ebertbm

回答by Powers

相关推荐

Python 警告：pip 正在被旧的脚本包装器调用

如何在 ipython 提示中显示当前目录

在 Python 中一起使用 IF、AND、OR 和 EQUAL 操作数

Python 请求。403 禁地

相关推荐

最近更新

标签