Python 从 Pandas 中具有多个值的列创建虚拟对象
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18889588/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Create dummies from column with multiple values in pandas
提问by mkln
I am looking for for a pythonic way to handle the following problem.
我正在寻找一种 pythonic 方法来处理以下问题。
The pandas.get_dummies()
method is great to create dummies from a categorical column of a dataframe. For example, if the column has values in ['A', 'B']
, get_dummies()
creates 2 dummy variables and assigns 0 or 1 accordingly.
该pandas.get_dummies()
方法非常适合从数据框的分类列创建虚拟对象。例如,如果该列的值在 中['A', 'B']
,则get_dummies()
创建 2 个虚拟变量并相应地分配 0 或 1。
Now, I need to handle this situation. A single column, let's call it 'label', has values like ['A', 'B', 'C', 'D', 'A*C', 'C*D']
. get_dummies()
creates 6 dummies, but I only want 4 of them, so that a row could have multiple 1s.
现在,我需要处理这种情况。单个列,我们称其为“标签”,其值类似于['A', 'B', 'C', 'D', 'A*C', 'C*D']
。get_dummies()
创建 6 个假人,但我只想要其中的 4 个,因此一行可以有多个 1。
Is there a way to handle this in a pythonic way? I could only think of some step-by-step algorithm to get it, but that would not include get_dummies(). Thanks
有没有办法以pythonic的方式处理这个问题?我只能想到一些逐步的算法来获得它,但这不会包括 get_dummies()。谢谢
Edited, hope it is more clear!
已编辑,希望更清楚!
采纳答案by offbyone
I know it's been a while since this question was asked, but there is (at least nowthere is) a one-liner that is supported by the documentation:
我知道自从提出这个问题以来已经有一段时间了,但是有(至少现在有)文档支持的单行:
In [4]: df
Out[4]:
label
0 (a, c, e)
1 (a, d)
2 (b,)
3 (d, e)
In [5]: df['label'].str.join(sep='*').str.get_dummies(sep='*')
Out[5]:
a b c d e
0 1 0 1 0 1
1 1 0 0 1 0
2 0 1 0 0 0
3 0 0 0 1 1
回答by Boud
You can generate the dummies dataframe with your raw data, isolate the columns that contains a given atom, and then store the result matches back to the atom column.
您可以使用原始数据生成虚拟数据框,隔离包含给定原子的列,然后将结果匹配存储回原子列。
df
Out[28]:
label
0 A
1 B
2 C
3 D
4 A*C
5 C*D
dummies = pd.get_dummies(df['label'])
atom_col = [c for c in dummies.columns if '*' not in c]
for col in atom_col:
...: df[col] = dummies[[c for c in dummies.columns if col in c]].sum(axis=1)
...:
df
Out[32]:
label A B C D
0 A 1 0 0 0
1 B 0 1 0 0
2 C 0 0 1 0
3 D 0 0 0 1
4 A*C 1 0 1 0
5 C*D 0 0 1 1
回答by ariddell
I have a somewhat cleaner solution. Assume we want to transform the following dataframe
我有一个更清洁的解决方案。假设我们要转换以下数据帧
pageid category
0 0 a
1 0 b
2 1 a
3 1 c
into
进入
a b c
pageid
0 1 1 0
1 1 0 1
One way to do it is to make use of scikit-learn's DictVectorizer. I would, however, be interested in learning about other methods.
一种方法是使用 scikit-learn 的 DictVectorizer。但是,我有兴趣学习其他方法。
df = pd.DataFrame(dict(pageid=[0, 0, 1, 1], category=['a', 'b', 'a', 'c']))
grouped = df.groupby('pageid').category.apply(lambda lst: tuple((k, 1) for k in lst))
category_dicts = [dict(tuples) for tuples in grouped]
v = sklearn.feature_extraction.DictVectorizer(sparse=False)
X = v.fit_transform(category_dicts)
pd.DataFrame(X, columns=v.get_feature_names(), index=grouped.index)
回答by Chris Farr
I believe this question needs an updated answer after coming across the MultiLabelBinarizerfrom sklearn.
我相信在遇到sklearn的MultiLabelBinarizer后,这个问题需要一个更新的答案。
The usage of this is as simple as...
这个的用法很简单……
# Instantiate the binarizer
mlb = MultiLabelBinarizer()
# Using OP's original data frame
df = pd.DataFrame(data=['A', 'B', 'C', 'D', 'A*C', 'C*D'], columns=["label"])
print(df)
label
0 A
1 B
2 C
3 D
4 A*C
5 C*D
# Convert to a list of labels
df = df.apply(lambda x: x["label"].split("*"), axis=1)
print(df)
0 [A]
1 [B]
2 [C]
3 [D]
4 [A, C]
5 [C, D]
dtype: object
# Transform to a binary array
array_out = mlb.fit_transform(df)
print(array_out)
[[1 0 0 0]
[0 1 0 0]
[0 0 1 0]
[0 0 0 1]
[1 0 1 0]
[0 0 1 1]]
# Convert back to a dataframe (unnecessary step in many cases)
df_out = pd.DataFrame(data=array_out, columns=mlb.classes_)
print(df_out)
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 1 0
5 0 0 1 1
This is also very fast, took virtually no time (.03 seconds) across 1000 rows and 50K classes.
这也非常快,在 1000 行和 50K 类中几乎没有时间(0.03 秒)。