Python 将数据帧保存到 pyspark 本地驱动器上的 JSON 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31077165/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:26:57  来源:igfitidea点击:

saving a dataframe to JSON file on local drive in pyspark

pythonjsonapache-sparkpyspark

提问by Jared

I have a dataframe that I am trying to save as a JSON file using pyspark 1.4, but it doesn't seem to be working. When i give it the path to the directory it returns an error stating it already exists. My assumption based off the documentationwas that it would save a json file in the path that you give it.

我有一个数据框,我试图使用 pyspark 1.4 将其另存为 JSON 文件,但它似乎不起作用。当我给它提供目录的路径时,它会返回一个错误,指出它已经存在。我基于文档的假设是它会在您提供的路径中保存一个 json 文件。

df.write.json("C:\Users\username")

Specifying a directory with a name doesn't produce any file and gives and error of "java.io.IOException: Mkdirs failed to create file:/C:Users/username/test/_temporary/....etc. It does however create a directory of the name test which contains several sub-directories with blank crc files.

指定具有名称的目录不会产生任何文件并给出“java.io.IOException: Mkdirs failed to create file:/C:Users/username/test/_temporary/....etc”的错误。但是它会创建一个名为 test 的目录,其中包含几个带有空白 crc 文件的子目录。

df.write.json("C:\Users\username\test")

And adding a file extension of JSON, produces the same error

并添加 JSON 的文件扩展名,产生相同的错误

df.write.json("C:\Users\username\test.JSON")

采纳答案by Wesley Bowman

Could you not just use

你能不能不只是使用

df.toJSON()

as shown here? If not, then first transform into a pandas DataFrame and then write to json.

如图所示这里?如果没有,则先转换为pandas DataFrame,然后写入json。

pandas_df = df.toPandas()
pandas_df.to_json("C:\Users\username\test.JSON")

回答by Brobin

I would avoid using write.jsonsince its causing problems on Windows. Using Python's file writing should skip creating the temp directories that are giving you issues.

我会避免使用,write.json因为它会在 Windows 上引起问题。使用 Python 的文件写入应该跳过创建给您带来问题的临时目录。

with open("C:\Users\username\test.json", "w+") as output_file:
    output_file.write(df.toJSON())

回答by Shreyak

When working with large data converting pyspark dataframe to pandas is not advisable. you can use below command to save json file in output directory. Here df is pyspark.sql.dataframe.DataFrame. Part file will be generated inside the output directory by the cluster.

在处理大数据时,不建议将 pyspark 数据帧转换为 Pandas。您可以使用以下命令将 json 文件保存在输出目录中。这里 df 是 pyspark.sql.dataframe.DataFrame。零件文件将由集群在输出目录中生成。

df.coalesce(1).write.format('json').save('/your_path/output_directory')