Pandas 将一列列表转换为哑元

Question

提问by user2900369

I have a dataframe where one column is a list of groups each of my users belongs to. Something like:

我有一个数据框，其中一列是我的每个用户所属的组列表。就像是：

index groups  
0     ['a','b','c']
1     ['c']
2     ['b','c','e']
3     ['a','c']
4     ['b','e']

And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses

我想做的是创建一系列虚拟列来标识每个用户属于哪些组，以便运行一些分析

index  a   b   c   d   e
0      1   1   1   0   0
1      0   0   1   0   0
2      0   1   1   0   1
3      1   0   1   0   0
4      0   1   0   0   0


pd.get_dummies(df['groups'])

won't work because that just returns a column for each different list in my column.

将不起作用，因为这只会为我的列中的每个不同列表返回一列。

The solution needs to be efficient as the dataframe will contain 500,000+ rows. Any advice would be appreciated!

该解决方案需要高效，因为数据帧将包含 500,000 多行。任何意见，将不胜感激！

Answer 1

回答by joris

Using sfor your df['groups']:

使用s您的df['groups']：

In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })

In [22]: s
Out[22]:
0    [a, b, c]
1          [c]
2    [b, c, e]
3       [a, c]
4       [b, e]
dtype: object

This is a possible solution:

这是一个可能的解决方案：

In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Out[23]:
   a  b  c  e
0  1  1  1  0
1  0  0  1  0
2  0  1  1  1
3  1  0  1  0
4  0  1  0  1

The logic of this is:

这样做的逻辑是：

.apply(Series)converts the series of lists to a dataframe
.stack()puts everything in one column again (creating a multi-level index)
pd.get_dummies( )creating the dummies
.sum(level=0) for remerging the different rows that should be one row (by summing up the second level, only keeping the original level (level=0))

.apply(Series)将一系列列表转换为数据框
.stack()再次将所有内容放在一列中（创建多级索引）
pd.get_dummies( )创建假人
.sum(level=0) 用于重新合并应该是一行的不同行（通过总结第二级，只保留原始级别 ( level=0)）

An slight equivalent is pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)

一个轻微的等价物是 pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)

If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.

我不知道这是否足够有效，但无论如何，如果性能很重要，将列表存储在数据框中并不是一个好主意。

Answer 2

回答by Teoretic

Very fast solution in case you have a large dataframe

非常快速的解决方案，以防您有大型数据框

Using sklearn.preprocessing.MultiLabelBinarizer

使用sklearn.preprocessing.MultiLabelBinarizer

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame(
    {'groups':
        [['a','b','c'],
        ['c'],
        ['b','c','e'],
        ['a','c'],
        ['b','e']]
    }, columns=['groups'])

s = df['groups']

mlb = MultiLabelBinarizer()

pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)

Result:

结果：

    a   b   c   e
0   1   1   1   0
1   0   0   1   0
2   0   1   1   1
3   1   0   1   0
4   0   1   0   1

Worked for me and also was suggested hereand here

对我来说有效，也有人建议在这里和这里

Answer 3

回答by Paulo Alves

Even though this quest was answered, I have a faster solution:

即使这个任务得到了回答，我有一个更快的解决方案：

df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

And, in case you have empty groups or NaN, you could just:

而且，如果您有空组或NaN，您可以：

df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

How it works

这个怎么运作

Inside the lambda, xis your list, for example ['a', 'b', 'c']. So pd.Serieswill be as follows:

在 lambda 中，x是您的列表，例如['a', 'b', 'c']. 所以pd.Series将如下：

In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]: 
a    1
b    1
c    1
dtype: int64

When all pd.Seriescomes together, they become pd.DataFrameand their indexbecome columns; missing indexbecame a columnwith NaNas you can see next:

当所有pd.Series走到一起，他们变得pd.DataFrame和他们index成为columns; 丢失index变成了一个column，NaN你可以在下面看到：

In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]: 
     a    b    c    d
0  1.0  1.0  1.0  NaN
1  1.0  1.0  NaN  1.0

Now fillnafills those NaNwith 0:

现在，fillna填充那些NaN有0：

In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]: 
     a    b    c    d
0  1.0  1.0  1.0  0.0
1  1.0  1.0  0.0  1.0

And downcast='infer'is to downcast from floatto int:

并且downcast='infer'是从float到int：

In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]: 
   a  b  c  d
0  1  1  1  0
1  1  1  0  1

PS.: It's not required the use of .fillna(0, downcast='infer').

PS.: 不需要使用.fillna(0, downcast='infer').

Pandas 将一列列表转换为哑元

提问by user2900369

回答by joris

回答by Teoretic

回答by Paulo Alves

How it works

这个怎么运作

相关推荐

最近更新

标签

Pandas 将一列列表转换为哑元

提问by user2900369

回答by joris

回答by Teoretic

回答by Paulo Alves

How it works

这个怎么运作

相关推荐

获取 HDF5 内容列表 (Pandas HDFStore)

pandas 冻结熊猫数据框中的标题

具有不同长度数组的 Pandas

转换数据帧的 Pandas dtype

相关推荐

最近更新

标签