pandas 并非所有类别都存在时的虚拟变量

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37425961/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:17:10  来源:igfitidea点击:

Dummy variables when not all categories are present

pythonpandasmachine-learningdummy-variable

提问by Berne

I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies.

我有一组数据框,其中一列包含分类变量。我想将它转换为几个虚拟变量,在这种情况下,我通常会使用get_dummies.

What happens is that get_dummieslooks at the data available in each dataframe to find out how many categories there are, and thus create the appropriate number of dummy variables. However, in the problem I'm working right now, I actually know in advance what the possible categories are. But when looking at each dataframe individually, not all categories necessarily appear.

发生的事情是get_dummies查看每个数据框中可用的数据以找出有多少类别,从而创建适当数量的虚拟变量。但是,在我现在正在处理的问题中,我实际上事先知道可能的类别是什么。但是当单独查看每个数据框时,不一定会出现所有类别。

My question is: is there a way to pass to get_dummies(or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

我的问题是:有没有办法get_dummies将类别的名称传递给(或等效的函数),以便对于未出现在给定数据框中的类别,它只会创建一列 0?

Something that would make this:

可以做到这一点的东西:

categories = ['a', 'b', 'c']

   cat
1   a
2   b
3   a

Become this:

变成这样:

  cat_a  cat_b  cat_c
1   1      0      0
2   0      1      0
3   1      0      0

采纳答案by piRSquared

Using transpose and reindex

使用转置和重新索引

import pandas as pd

cats = ['a', 'b', 'c']
df = pd.DataFrame({'cat': ['a', 'b', 'a']})

dummies = pd.get_dummies(df, prefix='', prefix_sep='')
dummies = dummies.T.reindex(cats).T.fillna(0)

print dummies

    a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  1.0  0.0  0.0

回答by T.C. Proctor

TL;DR: pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories)))

特尔;博士pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories)))

  • Older pandas: pd.get_dummies(cat.astype('category', categories=categories))
  • 大Pandas: pd.get_dummies(cat.astype('category', categories=categories))

is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

有没有办法将类别的名称传递给 get_dummies(或等效函数),以便对于未出现在给定数据框中的类别,它只会创建一列 0?

Yes, there is! Pandas has a special type of Series just for categorical data. One of the attributes of this series is the possible categories, which get_dummiestakes into account. Here's an example:

就在这里!Pandas 有一种特殊类型的 Series 仅用于分类数据。这个系列的属性之一是可能的类别,它get_dummies考虑了。下面是一个例子:

In [1]: import pandas as pd

In [2]: possible_categories = list('abc')

In [3]: cat = pd.Series(list('aba'))

In [4]: cat = cat.astype(pd.CategoricalDtype(categories=possible_categories))

In [5]: cat
Out[5]: 
0    a
1    b
2    a
dtype: category
Categories (3, object): [a, b, c]

Then, get_dummieswill do exactly what you want!

然后,get_dummies会做你想做的!

In [6]: pd.get_dummies(cat)
Out[6]: 
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0

There are a bunch of other ways to create a categorical Seriesor DataFrame, this is just the one I find most convenient. You can read about all of them in the pandas documentation.

有很多其他方法可以创建分类Seriesor DataFrame,这只是我觉得最方便的一种。您可以在pandas 文档中阅读所有这些内容

EDIT:

编辑:

I haven't followed the exact versioning, but there was a bugin how pandas treats sparse matrices, at least until version 0.17.0. It was corrected by version 0.18.1 (released May 2016).

我没有遵循确切的版本控制,但是在 Pandas 如何处理稀疏矩阵方面存在一个错误,至少在 0.17.0 版本之前是这样。它已在 0.18.1 版(2016 年 5 月发布)中得到纠正。

For version 0.17.0, if you try to do this with the sparse=Trueoption with a DataFrame, the column of zeros for the missing dummy variable will be a column of NaN, and it will be converted to dense.

对于 0.17.0 版本,如果您尝试使用sparse=True带有 a的选项执行此操作,则DataFrame缺失虚拟变量的零列将是 的列NaN,并且它将转换为密集。

It looks like pandas 0.21.0 added a CategoricalDType, and creating categoricals which explicitly include the categories as in the original answer was deprecated, I'm not quite sure when.

看起来Pandas 0.21.0 添加了一个CategoricalDType,并且创建了明确包含原始答案中的类别的分类已被弃用,我不太确定何时。

回答by Kapil Sharma

Try this:

尝试这个:

In[1]: import pandas as pd
       cats = ["a", "b", "c"]

In[2]: df = pd.DataFrame({"cat": ["a", "b", "a"]})

In[3]: pd.concat((pd.get_dummies(df.cat, columns=cats), pd.DataFrame(columns=cats))).fillna(0)
Out[3]: 
     a    b    c
0  1.0  0.0  0
1  0.0  1.0  0
2  1.0  0.0  0

回答by Stefan

I don't think get_dummiesprovides this out of the box, it only allows for creating an extra columnthat highlights NaNvalues.

我认为get_dummies这不是开箱即用的,它只允许创建一个额外的column突出NaN值。

To add the missing columnsyourself, you could use pd.concatalong axis=0to vertically 'stack' the DataFrames(the dummy columns plus a DataFrameid) and automatically create any missing columns, use fillna(0)to replace missing values, and then use .groupby('id')to separate the various DataFrameagain.

columns自己添加缺失值,您可以使用pd.concatwithaxis=0垂直“堆叠” DataFrames(虚拟列加 a DataFrameid)并自动创建任何缺失的列,用于fillna(0)替换缺失值,然后再次.groupby('id')用于分隔各个值DataFrame

回答by andre

I did ask this on the pandas github. Turns out it is really easy to get around it when you define the column as a Categoricalwhere you define all the possible categories.

我确实在Pandas github 上问过这个问题。事实证明,当您将列定义为Categorical您定义所有可能类别的位置时,它真的很容易解决。

df['col'] = pd.Categorical(df['col'], categories=['a', 'b', 'c', 'd'])

get_dummies()will do the rest then as expected.

get_dummies()将按预期完成其余的工作。

回答by Rudr

As suggested by others - Converting your Categorical features to 'category' data type should resolve the unseen label issue using 'get_dummies'.

正如其他人所建议的 - 将您的分类特征转换为“类别”数据类型应该使用“ get_dummies”解决看不见的标签问题。

# Your Data frame(df)
from sklearn.model_selection import train_test_split
X = df.loc[:,df.columns !='label']
Y = df.loc[:,df.columns =='label']

# Split the data into 70% training and 30% test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3) 

# Convert Categorical Columns in your data frame to type 'category'
for col in df.select_dtypes(include=[np.object]).columns:
    X_train[col] = X_train[col].astype('category', categories = df[col].unique())
    X_test[col] = X_test[col].astype('category', categories = df[col].unique())

# Now, use get_dummies on training, test data and we will get same set of columns
X_train = pd.get_dummies(X_train,columns = ["Categorical_Columns"])
X_test = pd.get_dummies(X_test,columns = ["Categorical_Columns"])

回答by Thibault Clement

Adding the missing category in the test set:

在测试集中添加缺失的类别:

# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]

Notice that this code also remove column resulting from category in the test dataset but not present in the training dataset

请注意,此代码还删除了由测试数据集中的类别产生但不存在于训练数据集中的列