Python 如何将 PySpark 中的表数据框导出到 csv？

Question

提问by PyRsquared

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrameobject (I have called it "table") to a csv file so I can manipulate it and plot the columns. How do I export the DataFrame"table" to a csv file?

我正在使用 Spark 1.3.1 (PySpark) 并且我使用 SQL 查询生成了一个表。我现在有一个对象是DataFrame. 我想将此DataFrame对象（我称之为“表”）导出到一个 csv 文件，以便我可以操作它并绘制列。如何将DataFrame“表格”导出到 csv 文件？

Thanks!

谢谢！

Answer 1

采纳答案by zero323

If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrameto local Pandas DataFrameusing toPandasmethod and then simply use to_csv:

如果数据帧适合驱动程序内存并且您想保存到本地文件系统，您可以使用方法将Spark DataFrame转换为本地Pandas DataFrame toPandas，然后简单地使用to_csv：

df.toPandas().to_csv('mycsv.csv')

Otherwise you can use spark-csv:

否则你可以使用spark-csv：

Spark 1.3

df.save('mycsv.csv', 'com.databricks.spark.csv')

Spark 1.4+

df.write.format('com.databricks.spark.csv').save('mycsv.csv')

火花1.3

df.save('mycsv.csv', 'com.databricks.spark.csv')

火花 1.4+

df.write.format('com.databricks.spark.csv').save('mycsv.csv')

In Spark 2.0+ you can use csvdata source directly:

在 Spark 2.0+ 中你可以csv直接使用数据源：

df.write.csv('mycsv.csv')

Answer 2

回答by jbochi

If you cannot use spark-csv, you can do the following:

如果您不能使用 spark-csv，您可以执行以下操作：

df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")

If you need to handle strings with linebreaks or comma that will not work. Use this:

如果您需要处理带有换行符或逗号的字符串，这些字符串将不起作用。用这个：

import csv
import cStringIO

def row2csv(row):
    buffer = cStringIO.StringIO()
    writer = csv.writer(buffer)
    writer.writerow([str(s).encode("utf-8") for s in row])
    buffer.seek(0)
    return buffer.read().strip()

df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")

Answer 3

回答by Matei Florescu

How about this (in you don't want an one liner) ?

这个怎么样（在你不想要一个单衬里）？

for row in df.collect():
    d = row.asDict()
    s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"])
    f.write(s)

f is a opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.

f 是打开的文件描述符。分隔符也是 TAB 字符，但可以轻松更改为您想要的任何内容。

Answer 4

回答by Shafiq

For Apache Spark 2+, in order to save dataframe into single csv file. Use following command

对于 Apache Spark 2+，为了将数据帧保存到单个 csv 文件中。使用以下命令

query.repartition(1).write.csv("cc_out.csv", sep='|')

Here 1indicate that I need one partition of csv only. you can change it according to your requirements.

这里1表明我只需要一个 csv 分区。您可以根据自己的要求进行更改。

Answer 5

回答by Gazal Patel

You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,

您需要在单个分区中重新分区 Dataframe，然后以 Unix 文件系统格式定义文件的格式、路径和其他参数，然后就可以了，

df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')

Read more about the repartition functionRead more about the save function

阅读有关重新分区功能的更多信息阅读有关保存功能的更多信息

However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.

然而，重新分区是一个代价高昂的函数，而 toPandas() 是最糟糕的。尝试在之前的语法中使用 .coalesce(1) 而不是 .repartition(1) 以获得更好的性能。

Read more on repartition vs coalesce functions.

阅读有关重新分区与合并函数的更多信息。

Python 如何将 PySpark 中的表数据框导出到 csv？

提问by PyRsquared

采纳答案by zero323

回答by jbochi

回答by Matei Florescu

回答by Shafiq

回答by Gazal Patel

相关推荐

最近更新

标签

Python 如何将 PySpark 中的表数据框导出到 csv？

提问by PyRsquared

采纳答案by zero323

回答by jbochi

回答by Matei Florescu

回答by Shafiq

回答by Gazal Patel

相关推荐

Python multiprocessing.cpu_count() 在 4 核 Nvidia Jetson TK1 上返回“1”

分组并聚合 Python 中字典列表的值

使用 Python 2.7.5 将文件夹中的所有压缩文件解压缩到同一文件夹

Python 如何从字符串生成带有 selenium/phantomjs 的 png 文件？

相关推荐

最近更新

标签