pandas 在熊猫数据框中展平嵌套的 Json

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/52795561/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:04:59  来源:igfitidea点击:

flattening nested Json in pandas data frame

pythonjsonpandasflatten

提问by Zephyr

I am trying to load the json file to pandas data frame. I found that there were some nested json. Below is the sample json:

我正在尝试将 json 文件加载到 Pandas 数据框。我发现有一些嵌套的json。下面是示例 json:

{'events': [{'id': 142896214,
   'playerId': 37831,
   'teamId': 3157,
   'matchId': 2214569,
   'matchPeriod': '1H',
   'eventSec': 0.8935539999999946,
   'eventId': 8,
   'eventName': 'Pass',
   'subEventId': 85,
   'subEventName': 'Simple pass',
   'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}],
   'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}

I used the following code to load json into dataframe:

我使用以下代码将 json 加载到数据帧中:

with open('EVENTS.json') as f:
    jsonstr = json.load(f)

df = pd.io.json.json_normalize(jsonstr['events'])

Below is the output of df.head()

下面是 df.head() 的输出

output of df

df 的输出

Here is the output

这是输出

But I found two nested columns such as positions and tags.

但是我发现了两个嵌套的列,例如位置和标签。

I tried using the following code to flatten it:

我尝试使用以下代码将其展平:

Position_data = json_normalize(data =jsonstr['events'], record_path='positions', meta = ['x','y','x','y'] )

It showed me an error as follow:

它向我显示了如下错误:

KeyError: "Try running with errors='ignore' as key 'x' is not always present"

Can you advise me how to flatten positions and tags ( those having nested data).

你能告诉我如何展平位置和标签(那些有嵌套数据的)。

Thanks, Zep

谢谢,泽普

回答by calestini

If you are looking for a more general way to unfold multiple hierarchies from a json you can use recursionand list comprehension to reshape your data. One alternative is presented below:

如果您正在寻找一种更通用的方法来从 json 展开多个层次结构,您可以使用recursion和列表理解来重塑您的数据。下面介绍了一种替代方案:

def flatten_json(nested_json, exclude=['']):
    """Flatten json object with nested keys into a single level.
        Args:
            nested_json: A nested json object.
            exclude: Keys to exclude from output.
        Returns:
            The flattened json object if successful, None otherwise.
    """
    out = {}

    def flatten(x, name='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude: flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

Then you can apply to your data, independent of nested levels:

然后您可以应用到您的数据,独立于嵌套级别:

New sample data

新样本数据

this_dict = {'events': [
  {'id': 142896214,
   'playerId': 37831,
   'teamId': 3157,
   'matchId': 2214569,
   'matchPeriod': '1H',
   'eventSec': 0.8935539999999946,
   'eventId': 8,
   'eventName': 'Pass',
   'subEventId': 85,
   'subEventName': 'Simple pass',
   'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}],
   'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]},
 {'id': 142896214,
   'playerId': 37831,
   'teamId': 3157,
   'matchId': 2214569,
   'matchPeriod': '1H',
   'eventSec': 0.8935539999999946,
   'eventId': 8,
   'eventName': 'Pass',
   'subEventId': 85,
   'subEventName': 'Simple pass',
   'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53},{'x': 51, 'y': 49}],
   'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}
]}

Usage

用法

pd.DataFrame([flatten_json(x) for x in this_dict['events']])

Out[1]:
          id  playerId  teamId  matchId matchPeriod  eventSec  eventId  \
0  142896214     37831    3157  2214569          1H  0.893554        8   
1  142896214     37831    3157  2214569          1H  0.893554        8   

  eventName  subEventId subEventName  positions_0_x  positions_0_y  \
0      Pass          85  Simple pass             51             49   
1      Pass          85  Simple pass             51             49   

   positions_1_x  positions_1_y  tags_0_id tags_0_tag_label  positions_2_x  \
0             40             53       1801         accurate            NaN   
1             40             53       1801         accurate           51.0   

   positions_2_y  
0            NaN  
1           49.0  

Note that this flatten_jsoncode is not mine, I have seen it hereand herewithout much certainty of the original source.

请注意,这段flatten_json代码不是我的,我在这里这里都看到过它,但对原始来源没有太多确定性。

回答by Trenton McKinney

data = {'events': [{'id': 142896214,
                    'playerId': 37831,
                    'teamId': 3157,
                    'matchId': 2214569,
                    'matchPeriod': '1H',
                    'eventSec': 0.8935539999999946,
                    'eventId': 8,
                    'eventName': 'Pass',
                    'subEventId': 85,
                    'subEventName': 'Simple pass',
                    'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}],
                    'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}]}

Create the DataFrame

创建数据框

df = pd.DataFrame.from_dict(data)
df = df['events'].apply(pd.Series)

enter image description here

在此处输入图片说明

Flatten positionswith pd.Series

拼合positionspd.Series

df_p = df['positions'].apply(pd.Series)

df_p_0 = df_p[0].apply(pd.Series)
df_p_1 = df_p[1].apply(pd.Series)

Rename positions[0]& positions[1]:

重命名positions[0]& positions[1]:

df_p_0.columns = ['pos_0_x', 'pos_0_y']
df_p_1.columns = ['pos_1_x', 'pos_1_y']

Flatten tagswith pd.Series:

拼合tagspd.Series

df_t = df.tags.apply(pd.Series)
df_t = df_t[0].apply(pd.Series)
df_t_t = df_t.tag.apply(pd.Series)

Rename id& label:

重命名id& label:

df_t =  df_t.rename(columns={'id': 'tags_id'})
df_t_t.columns = ['tags_tag_label']

Combine them all with pd.concat:

将它们全部与pd.concat

df_new = pd.concat([df, df_p_0, df_p_1, df_t.tags_id, df_t_t], axis=1)

Drop the old columns:

删除旧列:

df_new = df_new.drop(['positions', 'tags'], axis=1)

enter image description here

在此处输入图片说明