Python 如何将 PySpark 中的表数据框导出到 csv?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31385363/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:54:08  来源:igfitidea点击:

How to export a table dataframe in PySpark to csv?

pythonapache-sparkdataframeapache-spark-sqlexport-to-csv

提问by PyRsquared

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrameobject (I have called it "table") to a csv file so I can manipulate it and plot the columns. How do I export the DataFrame"table" to a csv file?

我正在使用 Spark 1.3.1 (PySpark) 并且我使用 SQL 查询生成了一个表。我现在有一个对象是DataFrame. 我想将此DataFrame对象(我称之为“表”)导出到一个 csv 文件,以便我可以操作它并绘制列。如何将DataFrame“表格”导出到 csv 文件?

Thanks!

谢谢!

采纳答案by zero323

If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrameto local Pandas DataFrameusing toPandasmethod and then simply use to_csv:

如果数据帧适合驱动程序内存并且您想保存到本地文件系统,您可以使用方法将Spark DataFrame转换为本地Pandas DataFrametoPandas,然后简单地使用to_csv

df.toPandas().to_csv('mycsv.csv')

Otherwise you can use spark-csv:

否则你可以使用spark-csv

  • Spark 1.3

    df.save('mycsv.csv', 'com.databricks.spark.csv')
    
  • Spark 1.4+

    df.write.format('com.databricks.spark.csv').save('mycsv.csv')
    
  • 火花1.3

    df.save('mycsv.csv', 'com.databricks.spark.csv')
    
  • 火花 1.4+

    df.write.format('com.databricks.spark.csv').save('mycsv.csv')
    

In Spark 2.0+ you can use csvdata source directly:

在 Spark 2.0+ 中你可以csv直接使用数据源:

df.write.csv('mycsv.csv')

回答by jbochi

If you cannot use spark-csv, you can do the following:

如果您不能使用 spark-csv,您可以执行以下操作:

df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")

If you need to handle strings with linebreaks or comma that will not work. Use this:

如果您需要处理带有换行符或逗号的字符串,这些字符串将不起作用。用这个:

import csv
import cStringIO

def row2csv(row):
    buffer = cStringIO.StringIO()
    writer = csv.writer(buffer)
    writer.writerow([str(s).encode("utf-8") for s in row])
    buffer.seek(0)
    return buffer.read().strip()

df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")

回答by Matei Florescu

How about this (in you don't want an one liner) ?

这个怎么样(在你不想要一个单衬里)?

for row in df.collect():
    d = row.asDict()
    s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"])
    f.write(s)

f is a opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.

f 是打开的文件描述符。分隔符也是 TAB 字符,但可以轻松更改为您想要的任何内容。

回答by Shafiq

For Apache Spark 2+, in order to save dataframe into single csv file. Use following command

对于 Apache Spark 2+,为了将数据帧保存到单个 csv 文件中。使用以下命令

query.repartition(1).write.csv("cc_out.csv", sep='|')

Here 1indicate that I need one partition of csv only. you can change it according to your requirements.

这里1表明我只需要一个 csv 分区。您可以根据自己的要求进行更改。

回答by Gazal Patel

You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,

您需要在单个分区中重新分区 Dataframe,然后以 Unix 文件系统格式定义文件的格式、路径和其他参数,然后就可以了,

df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')

Read more about the repartition functionRead more about the save function

阅读有关重新分区功能的更多信息 阅读有关保存功能的更多信息

However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.

然而,重新分区是一个代价高昂的函数,而 toPandas() 是最糟糕的。尝试在之前的语法中使用 .coalesce(1) 而不是 .repartition(1) 以获得更好的性能。

Read more on repartition vs coalesce functions.

阅读有关重新分区与合并函数的更多信息