如何将 DataFrame 转换为 Json?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31473215/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 17:58:57  来源:igfitidea点击:

How to convert DataFrame to Json?

jsonscalaapache-sparkapache-spark-sql

提问by ashish.garg

I have a huge Json file, a small part from it as follows:

我有一个巨大的 Json 文件,其中的一小部分如下:

{
    "socialNews": [{
        "adminTagIds": "",
        "fileIds": "",
        "departmentTagIds": "",
        ........
        ........
        "comments": [{
            "commentId": "",
            "newsId": "",
            "entityId": "",
            ....
            ....
        }]
    }]
    .....
    }

I have applied lateral view exlode on socialNews as follows:

我在socialNews上应用了横向视图exlode如下:

val rdd = sqlContext.jsonFile("file:///home/ashish/test")
rdd.registerTempTable("social")
val result = sqlContext.sql("select * from social LATERAL VIEW explode(socialNews) social AS comment")

Now I want to convert back this result (DataFrame) to json and save into a file, but i am not able to find any scala api to do conversion. Is there any standard library to do this or some way to figure it out?

现在我想将此结果(DataFrame)转换回 json 并保存到文件中,但我找不到任何 scala api 来进行转换。有没有标准库可以做到这一点,或者有什么方法可以解决这个问题?

回答by Nikita

val result: DataFrame = sqlContext.read.json(path)
result.write.json("/yourPath")

The method writeis in the class DataFrameWriterand should be accessible to you on DataFrameobjects. Just make sure that your rdd is of type DataFrameand not of deprecated type SchemaRdd. You can explicitly provide type definition val data: DataFrameor cast to dataFrame with toDF().

该方法write位于类DataFrameWriter 中,您应该可以在DataFrame对象上访问它。只要确保你的 rdd 是 typeDataFrame而不是 deprecated type SchemaRdd。您可以显式提供类型定义val data: DataFrame或使用toDF().

回答by MrChristine

If you have a DataFrame there is an API to convert back to an RDD[String] that contains the json records.

如果您有一个 DataFrame,则有一个 API 可以转换回包含 json 记录的 RDD[String]。

val df = Seq((2012, 8, "Batman", 9.8), (2012, 8, "Hero", 8.7), (2012, 7, "Robot", 5.5), (2011, 7, "Git", 2.0)).toDF("year", "month", "title", "rating")
df.toJSON.saveAsTextFile("/tmp/jsonRecords")
df.toJSON.take(2).foreach(println)

This should be available from Spark 1.4 onward. Call the API on the result DataFrame you created.

这应该从 Spark 1.4 开始可用。在您创建的结果 DataFrame 上调用 API。

The APIs available are listed here

此处列出可用的 API

回答by abhijitcaps

sqlContext.read().json(dataFrame.toJSON())

回答by Chetan Tamballa

If you still can't figure out a way to convert Dataframe into JSON, you can use to_json or toJSON inbuilt Spark functions.

如果您仍然无法找到将 Dataframe 转换为 JSON 的方法,您可以使用 to_json 或 toJSON 内置 Spark 函数。

Let me know if you have a sample Dataframe and a format of JSON to convert.

如果您有示例数据帧和要转换的 JSON 格式,请告诉我。

回答by Ganesh

When you run your spark job as
--master local --deploy-mode client
Then,
df.write.json('path/to/file/data.json')works.

当你运行你的 spark 作业时
--master local --deploy-mode client
,就
df.write.json('path/to/file/data.json')可以了。

If you run on cluster [on header node], [--master yarn --deploy-mode cluster] better approach is to write data to aws s3 or azure blob and read from it.

如果您在集群 [头节点] 上运行,[ --master yarn --deploy-mode cluster] 更好的方法是将数据写入 aws s3 或 azure blob 并从中读取。

df.write.json('s3://bucket/path/to/file/data.json')works.

df.write.json('s3://bucket/path/to/file/data.json')作品。