pandas 将 json 嵌套到 csv - 通用方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37706351/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:21:20  来源:igfitidea点击:

Nested json to csv - generic approach

pythonjsoncsvpandas

提问by An economist

I am very new to Python and I am struggling with converting nested jsonfile into cvs. To do so I started with loading the jsonand then transformed it in a way that prints out nice output with json_normalize, then using pandas package I output the normalised parts into cvs.

我对 Python 非常陌生,我正在努力将嵌套json文件转换为cvs. 为此,我首先加载json,然后以一种使用 json_normalize 打印出漂亮输出的方式对其进行转换,然后使用 pandas 包将标准化部分输出到cvs.

My example json:

我的示例 json:

[{
 "_id": {
   "id": "123"
 },
 "device": {
   "browser": "Safari",
   "category": "d",
   "os": "Mac"
 },
 "exID": {
   "$oid": "123"
 },
 "extreme": false,
 "geo": {
   "city": "London",
   "country": "United Kingdom",
   "countryCode": "UK",
   "ip": "00.000.000.0"
 },
 "viewed": {
   "$date": "2011-02-12"
 },
 "attributes": [{
   "name": "gender",
   "numeric": 0,
   "value": 0
 }, {
   "name": "email",
   "value": false
 }],
 "change": [{
   "id": {
     "$id": "1231"
   },
   "seen": [{
     "$date": "2011-02-12"
   }]
 }]
}, {
 "_id": {
   "id": "456"
 },
 "device": {
   "browser": "Chrome 47",
   "category": "d",
   "os": "Windows"
 },
 "exID": {
   "$oid": "345"
 },
 "extreme": false,
 "geo": {
   "city": "Berlin",
   "country": "Germany",
   "countryCode": "DE",
   "ip": "00.000.000.0"
 },
 "viewed": {
   "$date": "2011-05-12"
 },
 "attributes": [{
   "name": "gender",
   "numeric": 1,
   "value": 1
 }, {
   "name": "email",
   "value": true
 }],
 "change": [{
   "id": {
     "$id": "1231"
   },
   "seen": [{
     "$date": "2011-02-12"
   }]
 }]
}]

With following code (here I exclude the nested parts):

使用以下代码(这里我排除了嵌套部分):

import json
from pandas.io.json import json_normalize


def loading_file():
    #File path
    file_path = #file path here

    #Loading json file
    json_data = open(file_path)
    data = json.load(json_data)
    return data

#Storing avaliable keys
def data_keys(data):
    keys = {}
    for i in data:
        for k in i.keys():
            keys[k] = 1

    keys = keys.keys()

#Excluding nested arrays from keys - hard coded -> IMPROVE
    new_keys = [x for x in keys if
    x != 'attributes' and
    x != 'change']

    return new_keys

#Excluding nested arrays from json dictionary
def new_data(data, keys):
    new_data = []
    for i in range(0, len(data)):
        x = {k:v for (k,v) in data[i].items() if k in keys }
        new_data.append(x)
    return new_data

 def csv_out(data):
     data.to_csv('out.csv',encoding='utf-8')

def main():
     data_file = loading_file()
     keys = data_keys(data_file)
     table = new_data(data_file, keys)
     csv_out(json_normalize(table))

main()

My current output looks something like this:

我当前的输出看起来像这样:

| _id.id | device.browser | device.category | device.os |  ... | viewed.$date |
|--------|----------------|-----------------|-----------|------|--------------|
| 123    | Safari         | d               | Mac       | ...  | 2011-02-12   |
| 456    | Chrome 47      | d               | Windows   | ...  | 2011-05-12   |
|        |                |                 |           |      |              |

My problem is that I would like to include the nested arrays into the cvs, so I have to flatten them. I cannot figure out how to make it generic so I do not use dictionary keys(numeric, id, name) and valueswhile creating table. I have to make it generalisable because the number of keys in attributesand change. Therefore, I would like to have output like this:

我的问题是我想将嵌套数组包含到 cvs 中,所以我必须将它们展平。我不知道如何使它通用,所以我在创建表时不使用字典keys( numeric, id, name) values。我必须使它普遍意义,因为按键的数量attributeschange。因此,我希望有这样的输出:

| _id.id | device.browser | ... | attributes_gender_numeric | attributes_gender_value | attributes_email_value | change_id | change_seen |
|--------|----------------|-----|---------------------------|-------------------------|------------------------|-----------|-------------|
| 123    | Safari         | ... | 0                         | 0                       | false                  | 1231      | 2011-02-12  |
| 456    | Chrome 47      | ... | 1                         | 1                       | true                   | 1231      | 2011-02-12  |
|        |                |     |                           |                         |                        |           |             |

Thank you in advance! Any tips how to improve my code and make it more efficient are very welcome.

先感谢您!任何如何改进我的代码并使其更高效的提示都非常受欢迎。

回答by An economist

Thanks to the great blog post by Amir Ziai which you can find hereI managed to output my data in form of a flat table. With the following function:

感谢 Amir Ziai 的精彩博客文章,您可以在这里找到我设法以平面表格的形式输出我的数据。具有以下功能:

#Function that recursively extracts values out of the object into a flattened dictionary
def flatten_json(data):
    flat = [] #list of flat dictionaries
    def flatten(y):
        out = {}

        def flatten2(x, name=''):
            if type(x) is dict:
                for a in x:
                    if a == "name": 
                            flatten2(x["value"], name + x[a] + '_')
                    else:  
                        flatten2(x[a], name + a + '_')
            elif type(x) is list:
                for a in x:
                    flatten2(a, name + '_')
            else:
                out[name[:-1]] = x

        flatten2(y)
        return out

#Loop needed to flatten multiple objects
    for i in range(len(data)):
        flat.append(flatten(data[i]).copy())

    return json_normalize(flat) 

I am aware of the fact that it is not perfectly generalisable, due to name-value if statement. However, if this exemption for creating the name-value dictionaries is deleted, the code can be used with other embedded arrays.

我知道由于 name-value if 语句,它不是完全可推广的。但是,如果删除了用于创建名称-值字典的豁免,则该代码可以与其他嵌入式数组一起使用。

回答by Marcus Vinicius Melo

I had a task to turn a json with nested key and values into a csv file a couple of weeks ago. For this task it was necessary to handle the nested keys properly to concatenate the to be used as unique headers for the values. The result was the code bellow, which can also be found here.

几周前,我有一项任务是将带有嵌套键和值的 json 转换为 csv 文件。对于此任务,有必要正确处理嵌套键以连接用作值的唯一标头的 。结果是下面的代码,也可以在这里找到。

def get_flat_json(json_data, header_string, header, row):
    """Parse json files with nested key-vales into flat lists using nested column labeling"""
    for root_key, root_value in json_data.items():
        if isinstance(root_value, dict):
            get_flat_json(root_value, header_string + '_' + str(root_key), header, row)
        elif isinstance(root_value, list):
            for value_index in range(len(root_value)):
                for nested_key, nested_value in root_value[value_index].items():
                    header[0].append((header_string +
                                      '_' + str(root_key) +
                                      '_' + str(nested_key) +
                                      '_' + str(value_index)).strip('_'))
                    if nested_value is None:
                        nested_value = ''
                    row[0].append(str(nested_value))
        else:
            if root_value is None:
                root_value = ''
            header[0].append((header_string + '_' + str(root_key)).strip('_'))
            row[0].append(root_value)
    return header, row

This is a more generalized approach based on An Economist answer to this question.

这是基于经济学家对此问题的回答的更通用的方法。

回答by user565447

use pandas (run "pip install pandas" in the console), 2 lines of code:

使用pandas(在控制台运行“ pip install pandas”),两行代码:

import pandas

json = pandas.read_json('2.json')
json.to_csv('1.csv')