Python 如何将一些列作为 json 的 Pandas 数据框展平？

Question

提问by sfactor

I have a dataframe dfthat loads data from a database. Most of the columns are json strings while some are even list of jsons. For example:

我有一个df从数据库加载数据的数据框。大多数列是 json 字符串，而有些列甚至是 json 列表。例如：

id     name     columnA                               columnB
1     John     {"dist": "600", "time": "0:12.10"}    [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "3rd", "value": "200"}, {"pos": "total", "value": "1000"}]
2     Mike     {"dist": "600"}                       [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "total", "value": "800"}]
...

As you can see, not all the rows have the same number of elements in the json strings for a column.

如您所见，并非所有行在列的 json 字符串中都具有相同数量的元素。

What I need to do is keep the normal columns like idand nameas it is and flatten the json columns like so:

我需要做的是保持正常的列状id，并name因为它是和扁平列像这样JSON：

id    name   columnA.dist   columnA.time   columnB.pos.1st   columnB.pos.2nd   columnB.pos.3rd     columnB.pos.total
1     John   600            0:12.10        500               300               200                 1000 
2     Mark   600            NaN            500               300               Nan                 800

I have tried using json_normalizelike so:

我试过json_normalize像这样使用：

from pandas.io.json import json_normalize
json_normalize(df)

But there seems to be some problems with keyerror. What is the correct way of doing this?

但是似乎有一些问题keyerror。这样做的正确方法是什么？

Answer 1

回答by Nickil Maveli

Here's a solution using json_normalize()again by using a custom function to get the data in the correct format understood by json_normalizefunction.

这是json_normalize()通过使用自定义函数以json_normalize函数理解的正确格式获取数据再次使用的解决方案。

import ast
from pandas.io.json import json_normalize

def only_dict(d):
    '''
    Convert json string representation of dictionary to a python dict
    '''
    return ast.literal_eval(d)

def list_of_dicts(ld):
    '''
    Create a mapping of the tuples formed after 
    converting json strings of list to a python list   
    '''
    return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])

A = json_normalize(df['columnA'].apply(only_dict).tolist()).add_prefix('columnA.')
B = json_normalize(df['columnB'].apply(list_of_dicts).tolist()).add_prefix('columnB.pos.')

Finally, join the DFson the common index to get:

最后，DFs在公共索引上加入得到：

df[['id', 'name']].join([A, B])

EDIT:-As per the comment by @MartijnPieters, the recommended way of decoding the json strings would be to use json.loads()which is much faster when compared to using ast.literal_eval()if you know that the data source is JSON.

编辑：-根据@MartijnPieters 的评论，推荐的解码 json 字符串的方法是使用json.loads()，ast.literal_eval()如果您知道数据源是 JSON，则使用它比使用要快得多。

Answer 2

回答by piRSquared

create a custom function to flatten columnBthen use pd.concat

创建一个自定义函数来展平columnB然后使用pd.concat

def flatten(js):
    return pd.DataFrame(js).set_index('pos').squeeze()

pd.concat([df.drop(['columnA', 'columnB'], axis=1),
           df.columnA.apply(pd.Series),
           df.columnB.apply(flatten)], axis=1)

Answer 3

回答by staonas

The quickest seems to be:

最快的似乎是：

json_struct = json.loads(df.to_json(orient="records"))    
df_flat = pd.io.json.json_normalize(json_struct) #use pd.io.json

Answer 4

回答by Michele Piccolini

TL;DRCopy-paste the following function and use it like this: flatten_nested_json_df(df)

TL;DR复制粘贴以下函数并像这样使用它：flatten_nested_json_df(df)

This is the most general function I could come up with:

这是我能想到的最通用的函数：

def flatten_nested_json_df(df):

    df = df.reset_index()

    print(f"original shape: {df.shape}")
    print(f"original columns: {df.columns}")


    # search for columns to explode/flatten
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()

    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    print(f"lists: {list_columns}, dicts: {dict_columns}")
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            print(f"flattening: {col}")
            # explode dictionaries horizontally, adding new columns
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            print(f"exploding: {col}")
            # explode lists vertically, adding new columns
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        # check if there are still dict o list fields to flatten
        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()

        print(f"lists: {list_columns}, dicts: {dict_columns}")

    print(f"final shape: {df.shape}")
    print(f"final columns: {df.columns}")
    return df

It takes a dataframe that may have nested lists and/or dicts in its columns, and recursively explodes/flattens those columns.

它需要一个可能在其列中嵌套列表和/或字典的数据框，并递归地分解/展平这些列。

It uses pandas' pd.json_normalizeto explode the dictionaries (creating new columns), and pandas' explodeto explode the lists (creating new rows).

它使用pandas'pd.json_normalize来分解字典（创建新列），使用pandas'explode来分解列表（创建新行）。

Simple to use:

使用简单：

# Test
df = pd.DataFrame(
    columns=['id','name','columnA','columnB'],
    data=[
        [1,'John',{"dist": "600", "time": "0:12.10"},[{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "3rd", "value": "200"}, {"pos": "total", "value": "1000"}]],
        [2,'Mike',{"dist": "600"},[{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "total", "value": "800"}]]
    ])

flatten_nested_json_df(df)

It's not the most efficient thing on earth, and it has the side effect of resetting your dataframe's index, but it gets the job done. Feel free to tweak it.

它不是地球上最有效的东西，它具有重置数据框索引的副作用，但它完成了工作。随意调整它。

Python 如何将一些列作为 json 的 Pandas 数据框展平？

提问by sfactor

回答by Nickil Maveli

回答by piRSquared

回答by staonas

回答by Michele Piccolini

相关推荐

最近更新

标签

Python 如何将一些列作为 json 的 Pandas 数据框展平？

提问by sfactor

回答by Nickil Maveli

回答by piRSquared

回答by staonas

回答by Michele Piccolini

相关推荐

Python 如何计算 Pandas 滚动窗口中的波动率（标准差）

Python，元组索引必须是整数，而不是元组？

使用python请求登录网站

Python matplotlib - 设置 x 轴比例

相关推荐

最近更新

标签