Python 扁平化多级 JSON

Question

提问by Bostjan

I am trying to convert JSON to CSV file, that I can use for further analysis. Issue with my structure is that I have quite some nested dict/lists when I convert my JSON file.

我正在尝试将 JSON 转换为 CSV 文件，以便进一步分析。我的结构的问题是我在转换 JSON 文件时有很多嵌套的字典/列表。

I tried to use pandas json_normalize(), but it only flattens first level.

我尝试使用 pandas json_normalize()，但它只会压平第一级。

import json
import pandas as pd
from pandas.io.json import json_normalize
from cs import CloudStack

api_key = xxxx
secret = xxxx
endpoint = xxxx

cs = CloudStack(endpoint=endpoint,
                key=api_key,
                secret=secret)

virtual_machines = cs.virtMach()

test = json_normalize(virtual_machines["virtualmachine"])

test.to_csv("test.csv", sep="|", index=False)

Any idea how to flatter whole JSON file, so I can create single line input to CSV file for single (in this case virtual machine) entry? I have tried couple of solutions posted here, but my result was always only first level was flattened.

知道如何修饰整个 JSON 文件，以便我可以为单个（在这种情况下为虚拟机）条目创建 CSV 文件的单行输入？我试过这里发布的几个解决方案，但我的结果总是只有第一级被压平。

This is sample JSON (in this case, I still get "securitygroup" and "nic" output as JSON format:

这是示例 JSON（在这种情况下，我仍然以 JSON 格式获得“securitygroup”和“nic”输出：

{
    "count": 13,
    "virtualmachine": [
        {
            "id": "1082e2ed-ff66-40b1-a41b-26061afd4a0b",
            "name": "test-2",
            "displayname": "test-2",
            "securitygroup": [
                {
                    "id": "9e649fbc-3e64-4395-9629-5e1215b34e58",
                    "name": "test",
                    "tags": []
                }
            ],
            "nic": [
                {
                    "id": "79568b14-b377-4d4f-b024-87dc22492b8e",
                    "networkid": "05c0e278-7ab4-4a6d-aa9c-3158620b6471"
                },
                {
                    "id": "3d7f2818-1f19-46e7-aa98-956526c5b1ad",
                    "networkid": "b4648cfd-0795-43fc-9e50-6ee9ddefc5bd"
                    "traffictype": "Guest"
                }
            ],
            "hypervisor": "KVM",
            "affinitygroup": [],
            "isdynamicallyscalable": false
        }
    ]
}

Thank you and best regards, Bostjan

谢谢你，最好的问候，Bostjan

Answer 1

回答by Bostjan

Thanks to gyx-hh, this has been resolved:

感谢 gyx-hh，这已经解决了：

I used following function (details can be found here):

我使用了以下功能（详细信息可以在这里找到）：

def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

This unfortunately completely flattens whole JSON, meaning that if you have multi-level JSON (many nested dictionaries), it might flatten everything into single line with tons of columns.

不幸的是，这完全扁平化了整个 JSON，这意味着如果您有多级 JSON（许多嵌套字典），它可能会将所有内容扁平化为包含大量列的单行。

What I used in the end was json_normalize()and specified structure that I required. Nice example of how to do it that way can be found here.

我最终使用的是json_normalize()并指定了我需要的结构。可以在此处找到有关如何以这种方式执行此操作的好示例。

Hopefully this hepls someone and again thank to gyx-hh for solution.

希望这对某人有所帮助，并再次感谢 gyx-hh 的解决方案。

Best regards

此致

Answer 2

回答by Azat Ibrakov

IMO accepted answerdoesn't properly handle JSON array.

IMO接受的答案不能正确处理 JSON 数组。

If JSON object has array as value then it should be flattened to array of objects like

如果 JSON 对象具有数组作为值，那么它应该被展平为对象数组，例如

{'a': [1, 2]} -> [{'a': 1}, {'a': 2}]

instead of adding index to key.

而不是向键添加索引。

And nested objects should be flattened by concatenating keys (e.g. with dot as separator) like

嵌套对象应该通过连接键（例如用点作为分隔符）来展平，例如

{'a': {'b': 1}} -> {'a.b': 1}

(and this is done correctly in accepted one).

（这在接受的一个中正确完成）。

With all these requirements I've ended up with following (developed and used in CPython3.5.3):

有了所有这些要求，我最终得到了以下内容（在CPython3.5.3 中开发和使用）：

from functools import (partial,
                       singledispatch)
from itertools import chain
from typing import (Dict,
                    List,
                    TypeVar)

Serializable = TypeVar('Serializable', None, int, bool, float, str, 
                       dict, list, tuple)
Array = List[Serializable]
Object = Dict[str, Serializable]


def flatten(object_: Object,
            *,
            path_separator: str = '.') -> Array[Object]:
    """
    Flattens given JSON object into list of objects with non-nested values.

    >>> flatten({'a': 1})
    [{'a': 1}]
    >>> flatten({'a': [1, 2]})
    [{'a': 1}, {'a': 2}]
    >>> flatten({'a': {'b': None}})
    [{'a.b': None}]
    """
    keys = set(object_)
    result = [dict(object_)]
    while keys:
        key = keys.pop()
        new_result = []
        for index, record in enumerate(result):
            try:
                value = record[key]
            except KeyError:
                new_result.append(record)
            else:
                if isinstance(value, dict):
                    del record[key]
                    new_value = flatten_nested_objects(
                            value,
                            prefix=key + path_separator,
                            path_separator=path_separator)
                    keys.update(new_value.keys())
                    new_result.append({**new_value, **record})
                elif isinstance(value, list):
                    del record[key]
                    new_records = [
                        flatten_nested_objects(sub_value,
                                               prefix=key + path_separator,
                                               path_separator=path_separator)
                        for sub_value in value]
                    keys.update(chain.from_iterable(map(dict.keys,
                                                        new_records)))
                    new_result.extend({**new_record, **record}
                                      for new_record in new_records)
                else:
                    new_result.append(record)
        result = new_result
    return result


@singledispatch
def flatten_nested_objects(object_: Serializable,
                           *,
                           prefix: str = '',
                           path_separator: str) -> Object:
    return {prefix[:-len(path_separator)]: object_}


@flatten_nested_objects.register(dict)
def _(object_: Object,
      *,
      prefix: str = '',
      path_separator: str) -> Object:
    result = dict(object_)
    for key in list(result):
        result.update(flatten_nested_objects(result.pop(key),
                                             prefix=(prefix + key
                                                     + path_separator),
                                             path_separator=path_separator))
    return result


@flatten_nested_objects.register(list)
def _(object_: Array,
      *,
      prefix: str = '',
      path_separator: str) -> Object:
    return {prefix[:-len(path_separator)]: list(map(partial(
            flatten_nested_objects,
            path_separator=path_separator),
            object_))}

Answer 3

回答by ak4zh

I use this simple function to normalize and flatten data to json. It accepts list, dict, tuple and flattens it to a json.

我使用这个简单的函数将数据规范化和扁平化为 json。它接受列表、字典、元组并将其展平为 json。

def normalize_data_to_json(raw_data: [list, dict, tuple], parent=""):
    from datetime import datetime
    from decimal import Decimal

    result = {}
    # key name normalise to snake case (single underscore)
    parent = parent.lower().replace(" ", "_") if isinstance(parent, str) else parent
    if isinstance(parent, str) and parent.startswith("__"):
        # if parent has no parent remove double underscore and treat as int if digit else as str
        # treating as int is better if passed data is a list so you output is index based dict
        parent = int(parent.lstrip("_")) if parent.lstrip("_").isdigit() else parent.lstrip("_")

    # handle str, int, float, and decimal.
    # you can easily add more data types as er your data
    if type(raw_data) in [str, int, float, Decimal]:
        result[parent] = float(raw_data) if isinstance(raw_data, Decimal) else raw_data

    # normalise datetime object
    elif isinstance(raw_data, datetime):
        result[parent] = raw_data.strftime("%Y-%m-%d %H:%M:%S")

    # normalise dict and all nested dicts.
    # all nests are joined with double underscore to identify parent key name with it's children
    elif isinstance(raw_data, dict):
        for k, v in raw_data.items():
            k = f'{parent}__{k}' if parent else k
            result.update(normalize_data_to_json(v, parent=k))

    # normalise list and tuple
    elif type(raw_data) in [list, tuple]:
        for i, sub_item in enumerate(raw_data, start=1):
            result.update(normalize_data_to_json(sub_item, f"{parent}__{i}"))

    # any data which did not matched above data types, normalise them using it's __str__
    else:
        result[parent] = str(raw_data)

    return result

Answer 4

回答by Paul Whipp

In case anyone else finds themselves here and is looking for a solution better suited to subsequent programmatic treatment:

如果其他人发现自己在这里并正在寻找更适合后续程序化处理的解决方案：

Flattening the lists creates the need to process the headings for list lengths etc. I wanted a solution where if there are 2 lists of e.g. 2 elements then there would be four rows generated yielding each valid potential data row (see below for actual examples):

展平列表需要处理列表长度等的标题。我想要一个解决方案，如果有 2 个列表，例如 2 个元素，那么将生成四行，产生每个有效的潜在数据行（有关实际示例，请参见下文）：

class MapFlattener:

    def __init__(self):
        self.headings = []
        self.rows = []

    def add_rows(self, headings, rows):
        self.headings = [*self.headings, *headings]
        if self.rows:
            new_rows = []
            for base_row in self.rows:
                for row in rows:
                    new_rows.append([*base_row, *row])
            self.rows = new_rows
        else:
            self.rows = rows

    def __call__(self, mapping):
        for heading, value in mapping.items():
            if isinstance(value, Mapping):
                sub_headings, sub_rows = MapFlattener()(value)
                sub_headings = [f'{heading}:{sub_heading}' for sub_heading in sub_headings]
                self.add_rows(sub_headings, sub_rows)
                continue

            if isinstance(value, list):
                self.add_rows([heading], [[e] for e in value])
                continue

            self.add_rows([heading], [[value]])

        return self.headings, self.rows


def map_flatten(mapping):
    return MapFlattener()(mapping)

This creates output more in line with relational data:

这将创建更符合关系数据的输出：

In [22]: map_flatten({'l': [1,2]})                                                                                                          
Out[22]: (['l'], [[1], [2]])

In [23]: map_flatten({'l': [1,2], 'n': 7})                                                                                                  
Out[23]: (['l', 'n'], [[1, 7], [2, 7]])

In [24]: map_flatten({'l': [1,2], 'n': 7, 'o': {'a': 1, 'b': 2}})                                                                           
Out[24]: (['l', 'n', 'o:a', 'o:b'], [[1, 7, 1, 2], [2, 7, 1, 2]])

This is particularly useful if you are using the csv in spreadsheets etc. and need to process the flattened data.

如果您在电子表格等中使用 csv 并且需要处理扁平化的数据，这将特别有用。

Answer 5

回答by TrickiDicki

Outputting in jsonpath format:

以jsonpath格式输出：

def convert(f):
    out = {}
    def flatten(x, name=None):
        if type(x) is dict:
            for a in x:
                val = '.'.join((name, a)) if name else a
                flatten(x[a], val)
        elif type(x) is list:
            for (i, a) in enumerate(x):
                flatten(a, name + f'[{str(i)}]')
        else:
            out[name] = x if x else ""
    flatten(f)
    return out

Answer 6

回答by Utkarsh Shukla

Just pass your dictionary here:

只需在此处传递您的字典：

def getKeyValuePair(dic,master_dic = {},master_key = None):
    keys = list(dic.keys())
    for key in keys:
        if type(dic[key]) == dict:
                getKeyValuePair(dic[key],master_dic = master_dic,master_key = key)
        else:
            if master_key == None:
                master_dic[key] = dic[key]
            else:
                master_dic[str(master_key)+'_'+str(key)] = dic[key]

   return master_dic

Python 扁平化多级 JSON

提问by Bostjan

回答by Bostjan

回答by Azat Ibrakov

回答by ak4zh

回答by Paul Whipp

回答by TrickiDicki

回答by Utkarsh Shukla

相关推荐

最近更新

标签

Python 扁平化多级 JSON

提问by Bostjan

回答by Bostjan

回答by Azat Ibrakov

回答by ak4zh

回答by Paul Whipp

回答by TrickiDicki

回答by Utkarsh Shukla

相关推荐

Python 在熊猫中删除列的最佳方法是什么

Python tensorflow 的 tf.nn.max_pool 中的“SAME”和“VALID”填充有什么区别？

Python错误：需要以下参数

如何在 Google Colab 笔记本中的“.py”文件中运行 Python 脚本？

相关推荐

最近更新

标签