Pandas - 在数据框中的列中展开嵌套的 json 数组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47765243/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:53:06  来源:igfitidea点击:

Pandas - expand nested json array within column in dataframe

pythonjsonpandas

提问by Eric D.

I have a json data (coming from mongodb) containing thousands of records (so an array/list of json object) with a structure like the below one for each object:

我有一个 json 数据(来自 mongodb),其中包含数千条记录(因此是 json 对象的数组/列表),每个对象的结构如下所示:

{
   "id":1,
   "first_name":"Mead",
   "last_name":"Lantaph",
   "email":"[email protected]",
   "gender":"Male",
   "ip_address":"231.126.209.31",
   "nested_array_to_expand":[
      {
         "property":"Quaxo",
         "json_obj":{
            "prop1":"Chevrolet",
            "prop2":"Mercy Streets"
         }
      },
      {
         "property":"Blogpad",
         "json_obj":{
            "prop1":"Hyundai",
            "prop2":"Flashback"
         }
      },
      {
         "property":"Yabox",
         "json_obj":{
            "prop1":"Nissan",
            "prop2":"Welcome Mr. Marshall (Bienvenido Mister Marshall)"
         }
      }
   ]
}

When loaded in a dataframe the "nested_array_to_expand" is a string containing the json (I do use "json_normalize" during loading). The expected result is to get a dataframe with 3 row (given the above example) and new columns for the nested objects such as below:

在数据帧中加载时,“nested_array_to_expand”是一个包含 json 的字符串(我在加载过程中使用了“json_normalize”)。预期结果是获得一个包含 3 行(给定上面的示例)和嵌套对象的新列的数据框,如下所示:

index   email first_name gender  id      ip_address last_name  \
0  [email protected]       Mead   Male   1  231.126.209.31   Lantaph   
1  [email protected]       Mead   Male   1  231.126.209.31   Lantaph   
2  [email protected]       Mead   Male   1  231.126.209.31   Lantaph   

  test.name                                      test.obj.ahah test.obj.buzz  
0     Quaxo                                      Mercy Streets     Chevrolet  
1   Blogpad                                          Flashback       Hyundai  
2     Yabox  Welcome Mr. Marshall (Bienvenido Mister Marshall)        Nissan  

I was able to get that result with the below function but it extremely slow (around 2s for 1k records) so I would like to either improve the existing code or find a completely different approach to get this result.

我能够使用以下函数获得该结果,但速度非常慢(1k 记录大约为 2 秒),因此我想改进现有代码或找到一种完全不同的方法来获得此结果。

def expand_field(field, df, parent_id='id'):
    all_sub = pd.DataFrame()
    # we need an id per row to be able to merge back dataframes
    # if no id, then we will create one based on index of rows
    if parent_id not in df:
        df[parent_id] = df.index

    # go through all rows and create a new dataframe with values
    for i, row in df.iterrows():
        try:
            sub = json_normalize(df[field].values[i])
            sub = sub.add_prefix(field + '.')
            sub['parent_id'] = row[parent_id]
            all_sub = all_sub.append(sub)
        except:
            print('crash')
            pass
    df = pd.merge(df, all_sub, left_on=parent_id, right_on='parent_id', how='left')
    #remove old columns
    del df["parent_id"]
    del df[field]
    #return expanded dataframe
    return df

Many thanks for your help.

非常感谢您的帮助。

===== EDIT for answering comment ====

===== 编辑以回答评论 ====

The data loaded from mongodb is an array of object. I load it with the following code:

从 mongodb 加载的数据是一个对象数组。我使用以下代码加载它:

data = json.loads(my_json_string)
df = json_normalize(data)

The output give me a dataframe with df["nested_array_to_expand"] as dtype object (string)

输出给我一个数据帧,其中 df["nested_array_to_expand"] 作为 dtype 对象(字符串)

0    [{'property': 'Quaxo', 'json_obj': {'prop1': '...
Name: nested_array_to_expand, dtype: object

回答by Romain

I propose an interesting answer I think using pandas.json_normalize.
I use it to expand the nested json-- maybe there is a better way, but you definitively should consider using this feature. Then you have just to rename the columns as you want.

我提出了一个有趣的答案,我认为使用pandas.json_normalize.
我用它来扩展嵌套json——也许有更好的方法,但你绝对应该考虑使用这个功能。然后您只需根据需要重命名列。

import io
from pandas import json_normalize

# Loading the json string into a structure
json_dict = json.load(io.StringIO(json_str))

df = pd.concat([pd.DataFrame(json_dict), 
                json_normalize(json_dict['nested_array_to_expand'])], 
                axis=1).drop('nested_array_to_expand', 1)

enter image description here

在此处输入图片说明

回答by Gabriel Fair

The following code is what you want. You can unroll the nested list using python's built in list function and passing that as a new dataframe. pd.DataFrame(list(json_dict['nested_col']))

下面的代码就是你想要的。您可以使用 python 的内置列表函数展开嵌套列表,并将其作为新数据帧传递。 pd.DataFrame(list(json_dict['nested_col']))

You might have to do several iterations of this, depending on how nested your data is.

您可能需要对此进行多次迭代,具体取决于您的数据的嵌套方式。

from pandas.io.json import json_normalize


df= pd.concat([pd.DataFrame(json_dict), pd.DataFrame(list(json_dict['nested_array_to_expand']))], axis=1).drop('nested_array_to_expand', 1)