Python 将数据帧保存到 pyspark 本地驱动器上的 JSON 文件

Question

提问by Jared

I have a dataframe that I am trying to save as a JSON file using pyspark 1.4, but it doesn't seem to be working. When i give it the path to the directory it returns an error stating it already exists. My assumption based off the documentationwas that it would save a json file in the path that you give it.

我有一个数据框，我试图使用 pyspark 1.4 将其另存为 JSON 文件，但它似乎不起作用。当我给它提供目录的路径时，它会返回一个错误，指出它已经存在。我基于文档的假设是它会在您提供的路径中保存一个 json 文件。

df.write.json("C:\Users\username")

Specifying a directory with a name doesn't produce any file and gives and error of "java.io.IOException: Mkdirs failed to create file:/C:Users/username/test/_temporary/....etc. It does however create a directory of the name test which contains several sub-directories with blank crc files.

指定具有名称的目录不会产生任何文件并给出“java.io.IOException: Mkdirs failed to create file:/C:Users/username/test/_temporary/....etc”的错误。但是它会创建一个名为 test 的目录，其中包含几个带有空白 crc 文件的子目录。

df.write.json("C:\Users\username\test")

And adding a file extension of JSON, produces the same error

并添加 JSON 的文件扩展名，产生相同的错误

df.write.json("C:\Users\username\test.JSON")

Answer 1

采纳答案by Wesley Bowman

Could you not just use

你能不能不只是使用

df.toJSON()

as shown here? If not, then first transform into a pandas DataFrame and then write to json.

如图所示这里？如果没有，则先转换为pandas DataFrame，然后写入json。

pandas_df = df.toPandas()
pandas_df.to_json("C:\Users\username\test.JSON")

Answer 2

回答by Brobin

I would avoid using write.jsonsince its causing problems on Windows. Using Python's file writing should skip creating the temp directories that are giving you issues.

我会避免使用，write.json因为它会在 Windows 上引起问题。使用 Python 的文件写入应该跳过创建给您带来问题的临时目录。

with open("C:\Users\username\test.json", "w+") as output_file:
    output_file.write(df.toJSON())

Answer 3

回答by Shreyak

When working with large data converting pyspark dataframe to pandas is not advisable. you can use below command to save json file in output directory. Here df is pyspark.sql.dataframe.DataFrame. Part file will be generated inside the output directory by the cluster.

在处理大数据时，不建议将 pyspark 数据帧转换为 Pandas。您可以使用以下命令将 json 文件保存在输出目录中。这里 df 是 pyspark.sql.dataframe.DataFrame。零件文件将由集群在输出目录中生成。

df.coalesce(1).write.format('json').save('/your_path/output_directory')

Python 将数据帧保存到 pyspark 本地驱动器上的 JSON 文件

提问by Jared

采纳答案by Wesley Bowman

回答by Brobin

回答by Shreyak

相关推荐

最近更新

标签

Python 将数据帧保存到 pyspark 本地驱动器上的 JSON 文件

提问by Jared

采纳答案by Wesley Bowman

回答by Brobin

回答by Shreyak

相关推荐

Python 如何将数组的前 N ​​个元素设置为零？

Python Function() 正好接受 2 个参数（给出 3 个）

Python 如何获取子目录名称列表

Python 猜数游戏优化（用户创号，电脑猜）

相关推荐

最近更新

标签

Python 如何将数组的前 N 个元素设置为零？