Python 如何对多列使用 OneHotEncoder 并自动删除每列的第一个虚拟变量?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44601533/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use OneHotEncoder for multiple columns and automatically drop first dummy variable for each column?
提问by Vijay
This is the dataset with 3 cols and 3 rows
这是具有 3 列和 3 行的数据集
Name Organization Department
Manie ? ABC2 FINANCE
Joyce ? ABC1 HR
Ami ? NSV2 HR
名称组织部
玛尼?ABC2金融
乔伊斯?ABC1 人力资源
阿美?NSV2人力资源
This is the code I have:
这是我的代码:
Now it is fine till here, how do i drop the first dummy variable column for each ?
现在一切正常,我如何为每个删除第一个虚拟变量列?
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data1.csv',encoding = "cp1252")
X = dataset.values
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_0 = LabelEncoder()
X[:, 0] = labelencoder_X_0.fit_transform(X[:, 0])
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()
回答by Max Power
import pandas as pd
df = pd.DataFrame({'name': ['Manie', 'Joyce', 'Ami'],
'Org': ['ABC2', 'ABC1', 'NSV2'],
'Dept': ['Finance', 'HR', 'HR']
})
df_2 = pd.get_dummies(df,drop_first=True)
test:
测试:
print(df_2)
Dept_HR Org_ABC2 Org_NSV2 name_Joyce name_Manie
0 0 1 0 0 1
1 1 0 0 1 0
2 1 0 1 0 0
UPDATE regarding your error with pd.get_dummies(X, columns =[1:]
:
更新关于您的错误pd.get_dummies(X, columns =[1:]
:
Per the documentation page, the columns
parameter takes "Column Names". So the following code would work:
根据文档页面,columns
参数采用“列名”。所以下面的代码可以工作:
df_2 = pd.get_dummies(df, columns=['Org', 'Dept'], drop_first=True)
output:
输出:
name Org_ABC2 Org_NSV2 Dept_HR
0 Manie 1 0 0
1 Joyce 0 0 1
2 Ami 0 1 1
If you really want to define your columns positionally, you could do it this way:
如果你真的想在位置上定义你的列,你可以这样做:
column_names_for_onehot = df.columns[1:]
df_2 = pd.get_dummies(df, columns=column_names_for_onehot, drop_first=True)
回答by MD Rijwan
I use my own template for doing that:
我使用我自己的模板来做到这一点:
from sklearn.base import TransformerMixin
import pandas as pd
import numpy as np
class DataFrameEncoder(TransformerMixin):
def __init__(self):
"""Encode the data.
Columns of data type object are appended in the list. After
appending Each Column of type object are taken dummies and
successively removed and two Dataframes are concated again.
"""
def fit(self, X, y=None):
self.object_col = []
for col in X.columns:
if(X[col].dtype == np.dtype('O')):
self.object_col.append(col)
return self
def transform(self, X, y=None):
dummy_df = pd.get_dummies(X[self.object_col],drop_first=True)
X = X.drop(X[self.object_col],axis=1)
X = pd.concat([dummy_df,X],axis=1)
return X
And for using this code just put this template in current directory with filename let's suppose CustomeEncoder.py and type in your code:
对于使用此代码,只需将此模板放在当前目录中,文件名让我们假设 CustomeEncoder.py 并输入您的代码:
from customEncoder import DataFrameEncoder
data = DataFrameEncoder().fit_transormer(data)
And all the object type data removed, Encoded, removed first and joined together to give the final desired output.
PS: That the input file to this template is Pandas Dataframe.
并且所有对象类型数据被删除、编码、首先删除并连接在一起以提供最终所需的输出。
PS:这个模板的输入文件是 Pandas Dataframe。
回答by Jyoti Prasad Pal
It is quite simple in scikit-learn version starting from 0.21. One can use the drop parameter in OneHotEncoder and use it to drop one of the categories per feature. By default, it won't drop. Details can be found in documentation.
从 0.21 开始的 scikit-learn 版本非常简单。可以使用 OneHotEncoder 中的 drop 参数并使用它来删除每个功能的类别之一。默认情况下,它不会掉落。详细信息可以在文档中找到。
//drops the first category in each feature
ohe = OneHotEncoder(drop='first', handle_unknown='error')
回答by Robert
Encode the categorical variables one at a time. The dummy variables should go to the beginning index of your data set. Then, just cut off the first column like this:
一次对一个分类变量进行编码。虚拟变量应位于数据集的起始索引处。然后,像这样切断第一列:
X = X[:, 1:]
Then encode and repeat the next variable.
然后编码并重复下一个变量。
回答by Joseph Puthumana
Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. Create a separate pipeline for categorical and numerical variable and apply ColumnTransformer. More info about it can be found here ColumnTransformer.
使用 ColumnTransformer 可以将 OneHotEncoder 仅应用于某些列。为分类和数值变量创建一个单独的管道并应用 ColumnTransformer。关于它的更多信息可以在这里找到ColumnTransformer。
Another great example of implementation of this is provided here.
回答by g?kmen atakan türkmen
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in range(Y.shape[1]):
Y[:,i] = le.fit_transform(Y[:,i])