使用 pandas 和 json_normalize 来展平嵌套的 JSON API 响应
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/53198931/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using pandas and json_normalize to flatten nested JSON API response
提问by user2752159
I have a deeply nested JSON that I am trying to turn into a Pandas Dataframe using json_normalize.
我有一个深度嵌套的 JSON,我正在尝试使用json_normalize将其转换为 Pandas Dataframe。
A generic sampleof the JSON data I'm working with looks looks like this (I've added context of what I'm trying to do at the bottom of the post):
一个通用的样品我用外表看起来像这样的工作(我已经添加了什么我想在文章底部做上下文)的JSON数据:
{
"per_page": 2,
"total": 1,
"data": [{
"total_time": 0,
"collection_mode": "default",
"href": "https://api.surveymonkey.com/v3/responses/5007154325",
"custom_variables": {
"custvar_1": "one",
"custvar_2": "two"
},
"custom_value": "custom identifier for the response",
"edit_url": "https://www.surveymonkey.com/r/",
"analyze_url": "https://www.surveymonkey.com/analyze/browse/",
"ip_address": "",
"pages": [
{
"id": "103332310",
"questions": [{
"answers": [{
"choice_id": "3057839051"
}
],
"id": "319352786"
}
]
},
{
"id": "44783164",
"questions": [{
"id": "153745381",
"answers": [{
"text": "some_name"
}
]
}
]
},
{
"id": "44783183",
"questions": [{
"id": "153745436",
"answers": [{
"col_id": "1087201352",
"choice_id": "1087201369",
"row_id": "1087201362"
}, {
"col_id": "1087201353",
"choice_id": "1087201373",
"row_id": "1087201362"
}
]
}
]
}
],
"date_modified": "1970-01-17T19:07:34+00:00",
"response_status": "completed",
"id": "5007154325",
"collector_id": "50253586",
"recipient_id": "0",
"date_created": "1970-01-17T19:07:34+00:00",
"survey_id": "105723396"
}
],
"page": 1,
"links": {
"self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2"
}
}
I'd like to end up with a dataframe that contains the question_id, page_id, response_id, and response data like this:
我想最终得到一个包含 question_id、page_id、response_id 和响应数据的数据框,如下所示:
choice_id col_id row_id text question_id page_id response_id
0 3057839051 NaN NaN NaN 319352786 103332310 5007154325
1 NaN NaN NaN some_name 153745381 44783164 5007154325
2 1087201369 1087201352 1087201362 NaN 153745436 44783183 5007154325
3 1087201373 1087201353 1087201362 NaN 153745436 44783183 5007154325
I can get close by running the following code (Python 3.6):
我可以通过运行以下代码(Python 3.6)来接近:
df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions'], meta='id', record_prefix ='question_')
print(df)
Which returns:
返回:
question_answers question_id id
0 [{'choice_id': '3057839051'}] 319352786 5007154325
1 [{'text': 'some_name'}] 153745381 5007154325
2 [{'col_id': '1087201352', 'choice_id': '108720... 153745436 5007154325
But if I try to run json_normalize at a deeper nest and keep the 'question_id' data from the above result, I can only get the page_id values to return, not true question_id values:
但是,如果我尝试在更深的嵌套中运行 json_normalize 并保留上述结果中的 'question_id' 数据,我只能返回 page_id 值,而不是真正的 question_id 值:
answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['questions', 'id'], ['pages', 'id']])
print(answers_df)
Returns:
返回:
choice_id col_id row_id text id questions.id pages.id
0 3057839051 NaN NaN NaN 5007154325 103332310 103332310
1 NaN NaN NaN some_name 5007154325 44783164 44783164
2 1087201369 1087201352 1087201362 NaN 5007154325 44783183 44783183
3 1087201373 1087201353 1087201362 NaN 5007154325 44783183 44783183
A complicating factor may be that all the above (question_id, page_id, response_id) are 'id:' in the JSON data.
一个复杂的因素可能是上述所有(question_id、page_id、response_id)在 JSON 数据中都是 'id:'。
I'm sure this is possible, but I can't get there. Any examples of how to do this?
我确定这是可能的,但我无法到达那里。有关如何执行此操作的任何示例?
Additional context:I'm trying to create a dataframe of SurveyMonkey API response output.
附加上下文:我正在尝试创建SurveyMonkey API 响应输出的数据框。
My long term goal is to re-create the "all responses" excel sheet that their export service provides.
我的长期目标是重新创建他们的导出服务提供的“所有回复”Excel 表。
I plan to do this by getting the response dataframe set up (above), and then use .apply()to match responses with their survey structure API output.
我计划通过设置响应数据框(上面)来做到这一点,然后使用.apply()将响应与他们的调查结构 API 输出匹配。
I've found the SurveyMonkey API pretty lackluster at providing useful output, but I'm new to Pandas so it's probably on me.
我发现 SurveyMonkey API 在提供有用的输出方面相当乏力,但我是 Pandas 的新手,所以它可能在我身上。
回答by y.luis
You need to modify the meta
parameter of your last option, and, if you want to rename columns to be exactly the way you want, you could do it with rename
:
您需要修改meta
最后一个选项的参数,并且,如果您想完全按照您想要的方式重命名列,您可以使用rename
:
answers_df = json_normalize(data=so_survey_responses['data'],
record_path=['pages', 'questions', 'answers'],
meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])\
.rename(index=str,
columns={'id': 'response_id', 'pages.questions.id': 'question_id', 'pages.id': 'page_id'})
回答by Abhinav Sood
There is no way to do this in a completely generic way using json_normalize()
. You can use the record_path
and meta
arguments to indicate how you want the JSON to be processed.
没有办法以完全通用的方式使用json_normalize()
. 您可以使用record_path
和meta
参数来指示您希望如何处理 JSON。
However, you can use the flatten packageto flatten your deeply nested JSON and then convert that to a Pandas dataframe. The page has example usageof how to flatten a deeply-nested JSON and convert to a Pandas dataframe.
但是,您可以使用flatten 包来展平深度嵌套的 JSON,然后将其转换为 Pandas 数据帧。该页面具有如何展平深度嵌套的 JSON 并转换为 Pandas 数据帧的示例用法。