Python 将 Pandas 列中的字典/列表拆分为单独的列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38231591/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:32:26  来源:igfitidea点击:

Splitting dictionary/list inside a Pandas Column into Separate Columns

pythonpandasdictionarydataframe

提问by llaffin

I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary (or list?) of values within it. The DataFrame looks like this:

我将数据保存在 postgreSQL 数据库中。我正在使用 Python2.7 查询这些数据并将其转换为 Pandas DataFrame。但是,此数据框的最后一列包含一个值字典(或列表?)。数据框看起来像这样:

[1] df
Station ID     Pollutants
8809           {"a": "46", "b": "3", "c": "12"}
8810           {"a": "36", "b": "5", "c": "8"}
8811           {"b": "2", "c": "7"}
8812           {"c": "11"}
8813           {"a": "82", "c": "15"}

I need to split this column into separate columns so that the DataFrame looks like this:

我需要将此列拆分为单独的列,以便 DataFrame 如下所示:

[2] df2
Station ID     a      b       c
8809           46     3       12
8810           36     5       8
8811           NaN    2       7
8812           NaN    NaN     11
8813           82     NaN     15

The major issue I'm having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: a, b, and c. And they always appear in the same order (a first, b second, c third).

我遇到的主要问题是列表的长度不同。但是所有列表最多只包含相同的 3 个值:a、b 和 c。它们总是以相同的顺序出现(a 第一个,b 第二个,c 第三个)。

The following code USED to work and return exactly what I wanted (df2).

以下代码用于工作并准确返回我想要的(df2)。

[3] df 
[4] objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
[5] df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
[6] print(df2)

I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:

我上周刚刚运行了这段代码,它运行良好。但是现在我的代码被破坏了,我从第 [4] 行收到了这个错误:

IndexError: out-of-bounds on slice (end) 

I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.

我没有对代码进行任何更改,但现在出现错误。我觉得这是因为我的方法不够健壮或不合适。

Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!

关于如何将此列列表拆分为单独的列的任何建议或指导将不胜感激!

EDIT: I think the .tolist() and .apply methods are not working on my code because it is one unicode string, i.e.:

编辑:我认为 .tolist() 和 .apply 方法不适用于我的代码,因为它是一个 unicode 字符串,即:

#My data format 
u{'a': '1', 'b': '2', 'c': '3'}

#and not
{u'a': '1', u'b': '2', u'c': '3'}

The data is importing from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the unicode?

数据是以这种格式从 postgreSQL 数据库导入的。对这个问题有什么帮助或想法吗?有没有办法转换unicode?

回答by joris

To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.

要将字符串转换为实际的字典,您可以执行df['Pollutant Levels'].map(eval). 之后,可以使用下面的解决方案将 dict 转换为不同的列。



Using a small example, you can use .apply(pd.Series):

使用一个小例子,您可以使用.apply(pd.Series)

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

In [3]: df
Out[3]:
   a                   b
0  1           {u'c': 1}
1  2           {u'd': 3}
2  3  {u'c': 5, u'd': 6}

In [4]: df['b'].apply(pd.Series)
Out[4]:
     c    d
0  1.0  NaN
1  NaN  3.0
2  5.0  6.0

To combine it with the rest of the dataframe, you can concatthe other columns with the above result:

要将其与数据框的其余部分结合起来,您可以concat将其他列与上述结果结合起来:

In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0


Using your code, this also works if I leave out the ilocpart:

使用您的代码,如果我省略该iloc部分,这也有效:

In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

回答by Lech Birek

I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize:

我知道这个问题已经很老了,但我来这里是为了寻找答案。现在实际上有一种更好(更快)的方法可以使用json_normalize

import pandas as pd

df2 = pd.json_normalize(df['Pollutant Levels'])

This avoids costly apply functions...

这避免了昂贵的应用功能......

回答by Merlin

Try this: The data returned from SQL has to converted into a Dict.or could it be "Pollutant Levels"is now Pollutants'

试试这个: 从 SQL 返回的数据必须转换为 Dict。或者可能 "Pollutant Levels"是现在Pollutants'

   StationID                   Pollutants
0       8809  {"a":"46","b":"3","c":"12"}
1       8810   {"a":"36","b":"5","c":"8"}
2       8811            {"b":"2","c":"7"}
3       8812                   {"c":"11"}
4       8813          {"a":"82","c":"15"}


df2["Pollutants"] = df2["Pollutants"].apply(lambda x : dict(eval(x)) )
df3 = df2["Pollutants"].apply(pd.Series )

    a    b   c
0   46    3  12
1   36    5   8
2  NaN    2   7
3  NaN  NaN  11
4   82  NaN  15


result = pd.concat([df, df3], axis=1).drop('Pollutants', axis=1)
result

   StationID    a    b   c
0       8809   46    3  12
1       8810   36    5   8
2       8811  NaN    2   7
3       8812  NaN  NaN  11
4       8813   82  NaN  15

回答by Hafizur Rahman

Merlin's answer is better and super easy, but we don't need a lambda function. The evaluation of dictionary can be safely ignored by either of the following two ways as illustrated below:

Merlin 的答案更好而且超级简单,但我们不需要 lambda 函数。可以通过以下两种方式之一安全地忽略字典的评估,如下所示:

Way 1: Two steps

方式一:两步

# step 1: convert the `Pollutants` column to Pandas dataframe series
df_pol_ps = data_df['Pollutants'].apply(pd.Series)

df_pol_ps:
    a   b   c
0   46  3   12
1   36  5   8
2   NaN 2   7
3   NaN NaN 11
4   82  NaN 15

# step 2: concat columns `a, b, c` and drop/remove the `Pollutants` 
df_final = pd.concat([df, df_pol_ps], axis = 1).drop('Pollutants', axis = 1)

df_final:
    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15

Way 2: The above two steps can be combined in one go:

方式二:以上两步可以合二为一:

df_final = pd.concat([df, df['Pollutants'].apply(pd.Series)], axis = 1).drop('Pollutants', axis = 1)

df_final:
    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15

回答by user9815968

I strongly recommend the method extract the column 'Pollutants':

我强烈推荐提取列“污染物”的方法:

df_pollutants = pd.DataFrame(df['Pollutants'].values.tolist(), index=df.index)

df_pollutants = pd.DataFrame(df['Pollutants'].values.tolist(), index=df.index)

it's much faster than

它比

df_pollutants = df['Pollutants'].apply(pd.Series)

df_pollutants = df['Pollutants'].apply(pd.Series)

when the size of df is giant.

当 df 的大小很大时。

回答by jpp

You can use joinwith pop+ tolist. Performance is comparable to concatwith drop+ tolist, but some may find this syntax cleaner:

您可以joinpop+ 一起使用tolist。性能concatdrop+相当tolist,但有些人可能会发现这种语法更清晰:

res = df.join(pd.DataFrame(df.pop('b').tolist()))

Benchmarking with other methods:

使用其他方法进行基准测试:

df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

def joris1(df):
    return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)

def joris2(df):
    return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)

def jpp(df):
    return df.join(pd.DataFrame(df.pop('b').tolist()))

df = pd.concat([df]*1000, ignore_index=True)

%timeit joris1(df.copy())  # 1.33 s per loop
%timeit joris2(df.copy())  # 7.42 ms per loop
%timeit jpp(df.copy())     # 7.68 ms per loop

回答by Jaroslav Bezděk

One line solution is following:

一行解决方案如下:

>>> df = pd.concat([df['Station ID'], df['Pollutants'].apply(pd.Series)], axis=1)
>>> print(df)
   Station ID    a    b   c
0        8809   46    3  12
1        8810   36    5   8
2        8811  NaN    2   7
3        8812  NaN  NaN  11
4        8813   82  NaN  15

回答by Emanuel Fontelles

I've concatenated those steps in a method, you have to pass only the dataframe and the column which contains the dict to expand:

我已经在一个方法中连接了这些步骤,您必须只传递数据框和包含要展开的 dict 的列:

def expand_dataframe(dw: pd.DataFrame, column_to_expand: str) -> pd.DataFrame:
    """
    dw: DataFrame with some column which contain a dict to expand
        in columns
    column_to_expand: String with column name of dw
    """
    import pandas as pd

    def convert_to_dict(sequence: str) -> Dict:
        import json
        s = sequence
        json_acceptable_string = s.replace("'", "\"")
        d = json.loads(json_acceptable_string)
        return d    

    expanded_dataframe = pd.concat([dw.drop([column_to_expand], axis=1),
                                    dw[column_to_expand]
                                    .apply(convert_to_dict)
                                    .apply(pd.Series)],
                                    axis=1)
    return expanded_dataframe

回答by Siraj S.

in one line:

在一行中:

df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)`