Python 使用 pyspark 覆盖火花输出
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35861099/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
overwriting a spark output using pyspark
提问by Devesh
I am trying to overwrite a Spark dataframe using the following option in PySpark but I am not successful
我正在尝试使用 PySpark 中的以下选项覆盖 Spark 数据帧,但我没有成功
spark_df.write.format('com.databricks.spark.csv').option("header", "true",mode='overwrite').save(self.output_file_path)
the mode=overwrite command is not successful
mode=overwrite 命令不成功
回答by
Try:
尝试:
spark_df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(self.output_file_path)
回答by Davos
Spark 1.4 and above has a built in csv function for the dataframewriter
Spark 1.4 及更高版本具有用于 dataframewriter 的内置 csv 函数
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter
e.g.
例如
spark_df.write.csv(path=self.output_file_path, header="true", mode="overwrite", sep="\t")
Which is syntactic sugar for
哪个是语法糖
spark_df.write.format("csv").mode("overwrite").options(header="true",sep="\t").save(path=self.output_file_path)
I think what is confusing is finding where exactly the options are available for each format in the docs.
我认为令人困惑的是找到文档中每种格式的确切可用选项。
These write related methods belong to the DataFrameWriter
class:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter
这些写相关的方法属于这个DataFrameWriter
类:https:
//spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter
The csv
method has these options available, also available when using format("csv")
:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.csv
该csv
方法有这些选项可用,使用时也可用format("csv")
:https:
//spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.csv
The way you need to supply parameters also depends on if the method takes a single (key, value)
tuple or keyword args. It's fairly standard to the way python works generally though, using (*args, **kwargs), it just differs from the Scala syntax.
您需要提供参数的方式还取决于该方法是否采用单个(key, value)
元组或关键字 args。尽管使用 (*args, **kwargs),但它与 Python 的一般工作方式相当标准,它只是与 Scala 语法不同。
For example
The option(key, value)
method takes one option as a tuple like option(header,"true")
and the .options(**options)
method takes a bunch of keyword assignments e.g. .options(header="true",sep="\t")
例如,该option(key, value)
方法采用一个选项作为元组,option(header,"true")
并且该.options(**options)
方法采用一堆关键字赋值,例如.options(header="true",sep="\t")