Python 使用 pyspark 覆盖火花输出

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35861099/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:05:05  来源:igfitidea点击:

overwriting a spark output using pyspark

pythonapache-sparkpyspark

提问by Devesh

I am trying to overwrite a Spark dataframe using the following option in PySpark but I am not successful

我正在尝试使用 PySpark 中的以下选项覆盖 Spark 数据帧,但我没有成功

spark_df.write.format('com.databricks.spark.csv').option("header", "true",mode='overwrite').save(self.output_file_path)

the mode=overwrite command is not successful

mode=overwrite 命令不成功

回答by

Try:

尝试:

spark_df.write.format('com.databricks.spark.csv') \
  .mode('overwrite').option("header", "true").save(self.output_file_path)

回答by Davos

Spark 1.4 and above has a built in csv function for the dataframewriter

Spark 1.4 及更高版本具有用于 dataframewriter 的内置 csv 函数

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

e.g.

例如

spark_df.write.csv(path=self.output_file_path, header="true", mode="overwrite", sep="\t")

Which is syntactic sugar for

哪个是语法糖

spark_df.write.format("csv").mode("overwrite").options(header="true",sep="\t").save(path=self.output_file_path)

I think what is confusing is finding where exactly the options are available for each format in the docs.

我认为令人困惑的是找到文档中每种格式的确切可用选项。

These write related methods belong to the DataFrameWriterclass: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

这些写相关的方法属于这个DataFrameWriter类:https: //spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

The csvmethod has these options available, also available when using format("csv"): https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.csv

csv方法有这些选项可用,使用时也可用format("csv")https: //spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.csv

The way you need to supply parameters also depends on if the method takes a single (key, value)tuple or keyword args. It's fairly standard to the way python works generally though, using (*args, **kwargs), it just differs from the Scala syntax.

您需要提供参数的方式还取决于该方法是否采用单个(key, value)元组或关键字 args。尽管使用 (*args, **kwargs),但它与 Python 的一般工作方式相当标准,它只是与 Scala 语法不同。

For example The option(key, value)method takes one option as a tuple like option(header,"true")and the .options(**options)method takes a bunch of keyword assignments e.g. .options(header="true",sep="\t")

例如,该option(key, value)方法采用一个选项作为元组,option(header,"true")并且该.options(**options)方法采用一堆关键字赋值,例如.options(header="true",sep="\t")