pandas 使用 json_normalize 压平嵌套的 json

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43536555/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:26:35  来源:igfitidea点击:

Using json_normalize to flatten nested json

pythonjsonpandasnormalize

提问by frli

I'm trying to flatten a json file using json_normalize in Python (Pandas), but being a noob at this I always seem to end up in a KeyError.

我正在尝试在 Python (Pandas) 中使用 json_normalize 来扁平化一个 json 文件,但是作为一个菜鸟,我似乎总是以 KeyError 告终。

What I would like to achieve is a DataFrame with all the Plays in a game.

我想要实现的是一个包含游戏中所有 Plays 的 DataFrame。

I've tried numerous variants of paths and prefixes, but no success. Googled a lot as well, but I'm still falling short.

我尝试了许多路径和前缀的变体,但没有成功。谷歌搜索了很多,但我仍然没有达到。

What I would like to end up with is a DataFrame like: period, time, type, player1, player2, xcord, ycord

我想最终得到一个数据帧,如:期间、时间、类型、播放器 1、播放器 2、xcord、ycord

import pandas as pd
import json

with open('PlayByPlay.json') as data_file:    
    data = json.load(data_file)

from pandas.io.json import json_normalize
records = json_normalize(data)

plays = records['data.game.plays.play'][0]
plays

Would generate

会产生

{'aoi': [8470324, 8473449, 8475158, 8475215, 8477499, 8477933],
 'apb': [],
 'as': 0,
 'asog': 0,
 'desc': 'Zack Kassian hit Kyle Okposo',
 'eventid': 7,
 'formalEventId': 'EDM7',
 'hoi': [8471678, 8475178, 8475660, 8476454, 8476457, 8476472],
 'hpb': [],
 'hs': 0,
 'hsog': 0,
 'localtime': '5:12 PM',
 'p1name': 'Zack Kassian',
 'p2name': 'Kyle Okposo',
 'p3name': '',
 'period': 1,
 'pid': 8475178,
 'pid1': 8475178,
 'pid2': 8473449,
 'pid3': '',
 'playername': 'Zack Kassian',
 'strength': 701,
 'sweater': '44',
 'teamid': 22,
 'time': '00:28',
 'type': 'Hit',
 'xcoord': 22,
 'ycoord': 38}

Json

杰森

     {'data': {'game': {'awayteamid': 7,
   'awayteamname': 'Buffalo Sabres',
   'awayteamnick': 'Sabres',
   'hometeamid': 22,
   'hometeamname': 'Edmonton Oilers',
   'hometeamnick': 'Oilers',
   'plays': {'play': [{'aoi': [8470324,
       8473449,
       8475158,
       8475215,
       8477499,
       8477933],
      'apb': [],
      'as': 0,
      'asog': 0,
      'desc': 'Zack Kassian hit Kyle Okposo',
      'eventid': 7,
      'formalEventId': 'EDM7',
      'hoi': [8471678, 8475178, 8475660, 8476454, 8476457, 8476472],
      'hpb': [],
      'hs': 0,
      'hsog': 0,
      'localtime': '5:12 PM',
      'p1name': 'Zack Kassian',
      'p2name': 'Kyle Okposo',
      'p3name': '',
      'period': 1,
      'pid': 8475178,
      'pid1': 8475178,
      'pid2': 8473449,
      'pid3': '',
      'playername': 'Zack Kassian',
      'strength': 701,
      'sweater': '44',
      'teamid': 22,
      'time': '00:28',
      'type': 'Hit',
      'xcoord': 22,
      'ycoord': 38},
     {'aoi': [8471742, 8475179, 8475215, 8475220, 8475235, 8475728],
      'apb': [],
      'as': 0,
      'asog': 0,
      'desc': 'Jesse Puljujarvi Tip-In saved by Robin Lehner',
      'eventid': 59,
      'formalEventId': 'EDM59',
      'hoi': [8473468, 8474034, 8475660, 8477498, 8477934, 8479344],
      'hpb': [],
      'hs': 0,
      'hsog': 1,
      'localtime': '5:13 PM',
      'p1name': 'Jesse Puljujarvi',
      'p2name': 'Robin Lehner',
      'p3name': '',
      'period': 1,
      'pid': 8479344,
      'pid1': 8479344,
      'pid2': 8475215,
      'pid3': '',
      'playername': 'Jesse Puljujarvi',
      'strength': 701,
      'sweater': '98',
      'teamid': 22,
      'time': '01:32',
      'type': 'Shot',
      'xcoord': 81,
      'ycoord': 3}]}},
  'refreshInterval': 0}}

回答by IanS

If you have only one game, this will create the dataframe you want:

如果您只有一款游戏,这将创建您想要的数据框:

json_normalize(data['data']['game']['plays']['play'])

Then you just need to extract the columns you're interested in.

然后你只需要提取你感兴趣的列。

回答by zinking

it might be un-intuitive to use this API when the structure becomes complicated. but the key is: json_normalize extracts JSON fields into table.

当结构变得复杂时,使用此 API 可能不直观。但关键是:json_normalize 将 JSON 字段提取到表中。

for my case: I have a table

就我而言:我有一张桌子

----------
|  fact  |  // each row is a json object {'a':a, 'b':b....}
----------

rrrrr = []
for index, row in data.iterrows():
    r1 = json_normalize(row['fact'])
    rrrrr.append(r1)
rr1 = pd.concat(rrrrr)