使用 pandas 和 json_normalize 来展平嵌套的 JSON API 响应

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53198931/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:08:30  来源:igfitidea点击:

Using pandas and json_normalize to flatten nested JSON API response

pythonjsonpandassurveymonkey

提问by user2752159

I have a deeply nested JSON that I am trying to turn into a Pandas Dataframe using json_normalize.

我有一个深度嵌套的 JSON,我正在尝试使用json_normalize将其转换为 Pandas Dataframe

A generic sampleof the JSON data I'm working with looks looks like this (I've added context of what I'm trying to do at the bottom of the post):

一个通用的样品我用外表看起来像这样的工作(我已经添加了什么我想在文章底部做上下文)的JSON数据:

{
    "per_page": 2,
    "total": 1,
    "data": [{
            "total_time": 0,
            "collection_mode": "default",
            "href": "https://api.surveymonkey.com/v3/responses/5007154325",
            "custom_variables": {
                "custvar_1": "one",
                "custvar_2": "two"
            },
            "custom_value": "custom identifier for the response",
            "edit_url": "https://www.surveymonkey.com/r/",
            "analyze_url": "https://www.surveymonkey.com/analyze/browse/",
            "ip_address": "",
            "pages": [
                {
                    "id": "103332310",
                    "questions": [{
                            "answers": [{
                                    "choice_id": "3057839051"
                                }
                            ],
                            "id": "319352786"
                        }
                    ]
                },
                {
                    "id": "44783164",
                    "questions": [{
                            "id": "153745381",
                            "answers": [{
                                    "text": "some_name"
                                }
                            ]
                        }
                    ]
                },
                {
                    "id": "44783183",
                    "questions": [{
                            "id": "153745436",
                            "answers": [{
                                    "col_id": "1087201352",
                                    "choice_id": "1087201369",
                                    "row_id": "1087201362"
                                }, {
                                    "col_id": "1087201353",
                                    "choice_id": "1087201373",
                                    "row_id": "1087201362"
                                }
                                ]
                            }
                        ]
                }
            ],
            "date_modified": "1970-01-17T19:07:34+00:00",
            "response_status": "completed",
            "id": "5007154325",
            "collector_id": "50253586",
            "recipient_id": "0",
            "date_created": "1970-01-17T19:07:34+00:00",
            "survey_id": "105723396"
        }
    ],
    "page": 1,
    "links": {
        "self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2"
    }
}

I'd like to end up with a dataframe that contains the question_id, page_id, response_id, and response data like this:

我想最终得到一个包含 question_id、page_id、response_id 和响应数据的数据框,如下所示:

    choice_id      col_id      row_id       text   question_id       page_id      response_id
0  3057839051         NaN         NaN        NaN     319352786     103332310       5007154325
1         NaN         NaN         NaN  some_name     153745381      44783164       5007154325
2  1087201369  1087201352  1087201362        NaN     153745436      44783183       5007154325
3  1087201373  1087201353  1087201362        NaN     153745436      44783183       5007154325

I can get close by running the following code (Python 3.6):

我可以通过运行以下代码(Python 3.6)来接近:

df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions'], meta='id', record_prefix ='question_')
print(df)

Which returns:

返回:

                                    question_answers question_id          id
0                      [{'choice_id': '3057839051'}]   319352786  5007154325
1                            [{'text': 'some_name'}]   153745381  5007154325
2  [{'col_id': '1087201352', 'choice_id': '108720...   153745436  5007154325

But if I try to run json_normalize at a deeper nest and keep the 'question_id' data from the above result, I can only get the page_id values to return, not true question_id values:

但是,如果我尝试在更深的嵌套中运行 json_normalize 并保留上述结果中的 'question_id' 数据,我只能返回 page_id 值,而不是真正的 question_id 值:

answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['questions', 'id'], ['pages', 'id']])
print(answers_df)

Returns:

返回:

    choice_id      col_id      row_id       text          id questions.id   pages.id
0  3057839051         NaN         NaN        NaN  5007154325    103332310  103332310
1         NaN         NaN         NaN  some_name  5007154325     44783164   44783164
2  1087201369  1087201352  1087201362        NaN  5007154325     44783183   44783183
3  1087201373  1087201353  1087201362        NaN  5007154325     44783183   44783183

A complicating factor may be that all the above (question_id, page_id, response_id) are 'id:' in the JSON data.

一个复杂的因素可能是上述所有(question_id、page_id、response_id)在 JSON 数据中都是 'id:'。

I'm sure this is possible, but I can't get there. Any examples of how to do this?

我确定这是可能的,但我无法到达那里。有关如何执行此操作的任何示例?

Additional context:I'm trying to create a dataframe of SurveyMonkey API response output.

附加上下文:我正在尝试创建SurveyMonkey API 响应输出的数据

My long term goal is to re-create the "all responses" excel sheet that their export service provides.

我的长期目标是重新创建他们的导出服务提供“所有回复”Excel 表

I plan to do this by getting the response dataframe set up (above), and then use .apply()to match responses with their survey structure API output.

我计划通过设置响应数据框(上面)来做到这一点,然后使用.apply()将响应与他们的调查结构 API 输出匹配。

I've found the SurveyMonkey API pretty lackluster at providing useful output, but I'm new to Pandas so it's probably on me.

我发现 SurveyMonkey API 在提供有用的输出方面相当乏力,但我是 Pandas 的新手,所以它可能在我身上。

回答by y.luis

You need to modify the metaparameter of your last option, and, if you want to rename columns to be exactly the way you want, you could do it with rename:

您需要修改meta最后一个选项的参数,并且,如果您想完全按照您想要的方式重命名列,您可以使用rename

answers_df = json_normalize(data=so_survey_responses['data'],
                        record_path=['pages', 'questions', 'answers'],
                        meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])\
.rename(index=str,
        columns={'id': 'response_id', 'pages.questions.id': 'question_id', 'pages.id': 'page_id'})

回答by Abhinav Sood

There is no way to do this in a completely generic way using json_normalize(). You can use the record_pathand metaarguments to indicate how you want the JSON to be processed.

没有办法以完全通用的方式使用json_normalize(). 您可以使用record_pathmeta参数来指示您希望如何处理 JSON。

However, you can use the flatten packageto flatten your deeply nested JSON and then convert that to a Pandas dataframe. The page has example usageof how to flatten a deeply-nested JSON and convert to a Pandas dataframe.

但是,您可以使用flatten 包来展平深度嵌套的 JSON,然后将其转换为 Pandas 数据帧。该页面具有何展平深度嵌套的 JSON 并转换为 Pandas 数据帧的示例用法