Pandas 将一列列表转换为哑元

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29034928/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:02:47  来源:igfitidea点击:

Pandas convert a column of list to dummies

pythonpandas

提问by user2900369

I have a dataframe where one column is a list of groups each of my users belongs to. Something like:

我有一个数据框,其中一列是我的每个用户所属的组列表。就像是:

index groups  
0     ['a','b','c']
1     ['c']
2     ['b','c','e']
3     ['a','c']
4     ['b','e']

And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses

我想做的是创建一系列虚拟列来标识每个用户属于哪些组,以便运行一些分析

index  a   b   c   d   e
0      1   1   1   0   0
1      0   0   1   0   0
2      0   1   1   0   1
3      1   0   1   0   0
4      0   1   0   0   0


pd.get_dummies(df['groups'])

won't work because that just returns a column for each different list in my column.

将不起作用,因为这只会为我的列中的每个不同列表返回一列。

The solution needs to be efficient as the dataframe will contain 500,000+ rows. Any advice would be appreciated!

该解决方案需要高效,因为数据帧将包含 500,000 多行。任何意见,将不胜感激!

回答by joris

Using sfor your df['groups']:

使用s您的df['groups']

In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })

In [22]: s
Out[22]:
0    [a, b, c]
1          [c]
2    [b, c, e]
3       [a, c]
4       [b, e]
dtype: object

This is a possible solution:

这是一个可能的解决方案:

In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Out[23]:
   a  b  c  e
0  1  1  1  0
1  0  0  1  0
2  0  1  1  1
3  1  0  1  0
4  0  1  0  1

The logic of this is:

这样做的逻辑是:

  • .apply(Series)converts the series of lists to a dataframe
  • .stack()puts everything in one column again (creating a multi-level index)
  • pd.get_dummies( )creating the dummies
  • .sum(level=0) for remerging the different rows that should be one row (by summing up the second level, only keeping the original level (level=0))
  • .apply(Series)将一系列列表转换为数据框
  • .stack()再次将所有内容放在一列中(创建多级索引)
  • pd.get_dummies( )创建假人
  • .sum(level=0) 用于重新合并应该是一行的不同行(通过总结第二级,只保留原始级别 ( level=0))

An slight equivalent is pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)

一个轻微的等价物是 pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)

If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.

我不知道这是否足够有效,但无论如何,如果性能很重要,将列表存储在数据框中并不是一个好主意。

回答by Teoretic

Very fast solution in case you have a large dataframe

非常快速的解决方案,以防您有大型数据框

Using sklearn.preprocessing.MultiLabelBinarizer

使用sklearn.preprocessing.MultiLabelBinarizer

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame(
    {'groups':
        [['a','b','c'],
        ['c'],
        ['b','c','e'],
        ['a','c'],
        ['b','e']]
    }, columns=['groups'])

s = df['groups']

mlb = MultiLabelBinarizer()

pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)

Result:

结果:

    a   b   c   e
0   1   1   1   0
1   0   0   1   0
2   0   1   1   1
3   1   0   1   0
4   0   1   0   1

Worked for me and also was suggested hereand here

对我来说有效,也有人建议在这里这里

回答by Paulo Alves

Even though this quest was answered, I have a faster solution:

即使这个任务得到了回答,我有一个更快的解决方案:

df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

And, in case you have empty groups or NaN, you could just:

而且,如果您有空组或NaN,您可以:

df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

How it works

这个怎么运作

Inside the lambda, xis your list, for example ['a', 'b', 'c']. So pd.Serieswill be as follows:

在 lambda 中,x是您的列表,例如['a', 'b', 'c']. 所以pd.Series将如下:

In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]: 
a    1
b    1
c    1
dtype: int64

When all pd.Seriescomes together, they become pd.DataFrameand their indexbecome columns; missing indexbecame a columnwith NaNas you can see next:

当所有pd.Series走到一起,他们变得pd.DataFrame和他们index成为columns; 丢失index变成了一个columnNaN你可以在下面看到:

In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]: 
     a    b    c    d
0  1.0  1.0  1.0  NaN
1  1.0  1.0  NaN  1.0

Now fillnafills those NaNwith 0:

现在,fillna填充那些NaN0

In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]: 
     a    b    c    d
0  1.0  1.0  1.0  0.0
1  1.0  1.0  0.0  1.0

And downcast='infer'is to downcast from floatto int:

并且downcast='infer'是从floatint

In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]: 
   a  b  c  d
0  1  1  1  0
1  1  1  0  1

PS.: It's not required the use of .fillna(0, downcast='infer').

PS.: 不需要使用.fillna(0, downcast='infer').