pandas 并非所有类别都存在时的虚拟变量
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37425961/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Dummy variables when not all categories are present
提问by Berne
I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies
.
我有一组数据框,其中一列包含分类变量。我想将它转换为几个虚拟变量,在这种情况下,我通常会使用get_dummies
.
What happens is that get_dummies
looks at the data available in each dataframe to find out how many categories there are, and thus create the appropriate number of dummy variables. However, in the problem I'm working right now, I actually know in advance what the possible categories are. But when looking at each dataframe individually, not all categories necessarily appear.
发生的事情是get_dummies
查看每个数据框中可用的数据以找出有多少类别,从而创建适当数量的虚拟变量。但是,在我现在正在处理的问题中,我实际上事先知道可能的类别是什么。但是当单独查看每个数据框时,不一定会出现所有类别。
My question is: is there a way to pass to get_dummies
(or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?
我的问题是:有没有办法get_dummies
将类别的名称传递给(或等效的函数),以便对于未出现在给定数据框中的类别,它只会创建一列 0?
Something that would make this:
可以做到这一点的东西:
categories = ['a', 'b', 'c']
cat
1 a
2 b
3 a
Become this:
变成这样:
cat_a cat_b cat_c
1 1 0 0
2 0 1 0
3 1 0 0
采纳答案by piRSquared
Using transpose and reindex
使用转置和重新索引
import pandas as pd
cats = ['a', 'b', 'c']
df = pd.DataFrame({'cat': ['a', 'b', 'a']})
dummies = pd.get_dummies(df, prefix='', prefix_sep='')
dummies = dummies.T.reindex(cats).T.fillna(0)
print dummies
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
回答by T.C. Proctor
TL;DR:
pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories)))
特尔;博士:
pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories)))
- Older pandas:
pd.get_dummies(cat.astype('category', categories=categories))
- 大Pandas:
pd.get_dummies(cat.astype('category', categories=categories))
is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?
有没有办法将类别的名称传递给 get_dummies(或等效函数),以便对于未出现在给定数据框中的类别,它只会创建一列 0?
Yes, there is! Pandas has a special type of Series just for categorical data. One of the attributes of this series is the possible categories, which get_dummies
takes into account. Here's an example:
就在这里!Pandas 有一种特殊类型的 Series 仅用于分类数据。这个系列的属性之一是可能的类别,它get_dummies
考虑了。下面是一个例子:
In [1]: import pandas as pd
In [2]: possible_categories = list('abc')
In [3]: cat = pd.Series(list('aba'))
In [4]: cat = cat.astype(pd.CategoricalDtype(categories=possible_categories))
In [5]: cat
Out[5]:
0 a
1 b
2 a
dtype: category
Categories (3, object): [a, b, c]
Then, get_dummies
will do exactly what you want!
然后,get_dummies
会做你想做的!
In [6]: pd.get_dummies(cat)
Out[6]:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
There are a bunch of other ways to create a categorical Series
or DataFrame
, this is just the one I find most convenient. You can read about all of them in the pandas documentation.
有很多其他方法可以创建分类Series
or DataFrame
,这只是我觉得最方便的一种。您可以在pandas 文档中阅读所有这些内容。
EDIT:
编辑:
I haven't followed the exact versioning, but there was a bugin how pandas treats sparse matrices, at least until version 0.17.0. It was corrected by version 0.18.1 (released May 2016).
我没有遵循确切的版本控制,但是在 Pandas 如何处理稀疏矩阵方面存在一个错误,至少在 0.17.0 版本之前是这样。它已在 0.18.1 版(2016 年 5 月发布)中得到纠正。
For version 0.17.0, if you try to do this with the sparse=True
option with a DataFrame
, the column of zeros for the missing dummy variable will be a column of NaN
, and it will be converted to dense.
对于 0.17.0 版本,如果您尝试使用sparse=True
带有 a的选项执行此操作,则DataFrame
缺失虚拟变量的零列将是 的列NaN
,并且它将转换为密集。
It looks like pandas 0.21.0 added a CategoricalDType
, and creating categoricals which explicitly include the categories as in the original answer was deprecated, I'm not quite sure when.
看起来Pandas 0.21.0 添加了一个CategoricalDType
,并且创建了明确包含原始答案中的类别的分类已被弃用,我不太确定何时。
回答by Kapil Sharma
Try this:
尝试这个:
In[1]: import pandas as pd
cats = ["a", "b", "c"]
In[2]: df = pd.DataFrame({"cat": ["a", "b", "a"]})
In[3]: pd.concat((pd.get_dummies(df.cat, columns=cats), pd.DataFrame(columns=cats))).fillna(0)
Out[3]:
a b c
0 1.0 0.0 0
1 0.0 1.0 0
2 1.0 0.0 0
回答by Stefan
I don't think get_dummies
provides this out of the box, it only allows for creating an extra column
that highlights NaN
values.
我认为get_dummies
这不是开箱即用的,它只允许创建一个额外的column
突出NaN
值。
To add the missing columns
yourself, you could use pd.concat
along axis=0
to vertically 'stack' the DataFrames
(the dummy columns plus a DataFrame
id
) and automatically create any missing columns, use fillna(0)
to replace missing values, and then use .groupby('id')
to separate the various DataFrame
again.
要columns
自己添加缺失值,您可以使用pd.concat
withaxis=0
垂直“堆叠” DataFrames
(虚拟列加 a DataFrame
id
)并自动创建任何缺失的列,用于fillna(0)
替换缺失值,然后再次.groupby('id')
用于分隔各个值DataFrame
。
回答by andre
I did ask this on the pandas github. Turns out it is really easy to get around it when you define the column as a Categorical
where you define all the possible categories.
我确实在Pandas github 上问过这个问题。事实证明,当您将列定义为Categorical
您定义所有可能类别的位置时,它真的很容易解决。
df['col'] = pd.Categorical(df['col'], categories=['a', 'b', 'c', 'd'])
get_dummies()
will do the rest then as expected.
get_dummies()
将按预期完成其余的工作。
回答by Rudr
As suggested by others - Converting your Categorical features to 'category' data type should resolve the unseen label issue using 'get_dummies'.
正如其他人所建议的 - 将您的分类特征转换为“类别”数据类型应该使用“ get_dummies”解决看不见的标签问题。
# Your Data frame(df)
from sklearn.model_selection import train_test_split
X = df.loc[:,df.columns !='label']
Y = df.loc[:,df.columns =='label']
# Split the data into 70% training and 30% test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
# Convert Categorical Columns in your data frame to type 'category'
for col in df.select_dtypes(include=[np.object]).columns:
X_train[col] = X_train[col].astype('category', categories = df[col].unique())
X_test[col] = X_test[col].astype('category', categories = df[col].unique())
# Now, use get_dummies on training, test data and we will get same set of columns
X_train = pd.get_dummies(X_train,columns = ["Categorical_Columns"])
X_test = pd.get_dummies(X_test,columns = ["Categorical_Columns"])
回答by Thibault Clement
Adding the missing category in the test set:
在测试集中添加缺失的类别:
# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]
Notice that this code also remove column resulting from category in the test dataset but not present in the training dataset
请注意,此代码还删除了由测试数据集中的类别产生但不存在于训练数据集中的列