Python 如何对多列使用 OneHotEncoder 并自动删除每列的第一个虚拟变量?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44601533/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:12:36  来源:igfitidea点击:

How to use OneHotEncoder for multiple columns and automatically drop first dummy variable for each column?

pythonpandasmachine-learningscikit-learn

提问by Vijay

This is the dataset with 3 cols and 3 rows

这是具有 3 列和 3 行的数据集

Name Organization Department

Manie ? ABC2 FINANCE

Joyce ? ABC1 HR

Ami ? NSV2 HR

名称组织部

玛尼?ABC2金融

乔伊斯?ABC1 人力资源

阿美?NSV2人力资源

This is the code I have:

这是我的代码:

Now it is fine till here, how do i drop the first dummy variable column for each ?

现在一切正常,我如何为每个删除第一个虚拟变量列?

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data1.csv',encoding = "cp1252")
X = dataset.values


# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_0 = LabelEncoder()
X[:, 0] = labelencoder_X_0.fit_transform(X[:, 0])
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()

回答by Max Power

import pandas as pd
df = pd.DataFrame({'name': ['Manie', 'Joyce', 'Ami'],
                   'Org':  ['ABC2', 'ABC1', 'NSV2'],
                   'Dept': ['Finance', 'HR', 'HR']        
        })


df_2 = pd.get_dummies(df,drop_first=True)

test:

测试:

print(df_2)
   Dept_HR  Org_ABC2  Org_NSV2  name_Joyce  name_Manie
0        0         1         0           0           1
1        1         0         0           1           0
2        1         0         1           0           0 


UPDATE regarding your error with pd.get_dummies(X, columns =[1:]:

更新关于您的错误pd.get_dummies(X, columns =[1:]

Per the documentation page, the columnsparameter takes "Column Names". So the following code would work:

根据文档页面columns参数采用“列名”。所以下面的代码可以工作:

df_2 = pd.get_dummies(df, columns=['Org', 'Dept'], drop_first=True)

output:

输出:

    name  Org_ABC2  Org_NSV2  Dept_HR
0  Manie         1         0        0
1  Joyce         0         0        1
2    Ami         0         1        1

If you really want to define your columns positionally, you could do it this way:

如果你真的想在位置上定义你的列,你可以这样做:

column_names_for_onehot = df.columns[1:]
df_2 = pd.get_dummies(df, columns=column_names_for_onehot, drop_first=True)

回答by MD Rijwan

I use my own template for doing that:

我使用我自己的模板来做到这一点:

from sklearn.base import TransformerMixin
import pandas as pd
import numpy as np
class DataFrameEncoder(TransformerMixin):

    def __init__(self):
        """Encode the data.

        Columns of data type object are appended in the list. After 
        appending Each Column of type object are taken dummies and 
        successively removed and two Dataframes are concated again.

        """
    def fit(self, X, y=None):
        self.object_col = []
        for col in X.columns:
            if(X[col].dtype == np.dtype('O')):
                self.object_col.append(col)
        return self

    def transform(self, X, y=None):
        dummy_df = pd.get_dummies(X[self.object_col],drop_first=True)
        X = X.drop(X[self.object_col],axis=1)
        X = pd.concat([dummy_df,X],axis=1)
        return X

And for using this code just put this template in current directory with filename let's suppose CustomeEncoder.py and type in your code:

对于使用此代码,只需将此模板放在当前目录中,文件名让我们假设 CustomeEncoder.py 并输入您的代码:

from customEncoder import DataFrameEncoder
data = DataFrameEncoder().fit_transormer(data)

And all the object type data removed, Encoded, removed first and joined together to give the final desired output.
PS: That the input file to this template is Pandas Dataframe.

并且所有对象类型数据被删除、编码、首先删除并连接在一起以提供最终所需的输出。
PS:这个模板的输入文件是 Pandas Dataframe。

回答by Jyoti Prasad Pal

It is quite simple in scikit-learn version starting from 0.21. One can use the drop parameter in OneHotEncoder and use it to drop one of the categories per feature. By default, it won't drop. Details can be found in documentation.

从 0.21 开始的 scikit-learn 版本非常简单。可以使用 OneHotEncoder 中的 drop 参数并使用它来删除每个功能的类别之一。默认情况下,它不会掉落。详细信息可以在文档中找到。

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

//drops the first category in each feature
ohe = OneHotEncoder(drop='first', handle_unknown='error')

回答by Robert

Encode the categorical variables one at a time. The dummy variables should go to the beginning index of your data set. Then, just cut off the first column like this:

一次对一个分类变量进行编码。虚拟变量应位于数据集的起始索引处。然后,像这样切断第一列:

X = X[:, 1:]

Then encode and repeat the next variable.

然后编码并重复下一个变量。

回答by Joseph Puthumana

Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. Create a separate pipeline for categorical and numerical variable and apply ColumnTransformer. More info about it can be found here ColumnTransformer.

使用 ColumnTransformer 可以将 OneHotEncoder 仅应用于某些列。为分类和数值变量创建一个单独的管道并应用 ColumnTransformer。关于它的更多信息可以在这里找到ColumnTransformer

Another great example of implementation of this is provided here.

此处提供另一个很好的实现示例。

回答by g?kmen atakan türkmen

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in range(Y.shape[1]):
    Y[:,i] = le.fit_transform(Y[:,i])