将 Scikit-Learn OneHotEncoder 与 Pandas DataFrame 结合使用
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/58101126/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using Scikit-Learn OneHotEncoder with a Pandas DataFrame
提问by dd.
I'm trying to replace a column within a Pandas DataFrame containing strings into a one-hot encoded equivalent using Scikit-Learn's OneHotEncoder. My code below doesn't work:
我正在尝试使用 Scikit-Learn 的 OneHotEncoder 将包含字符串的 Pandas DataFrame 中的列替换为单热编码的等效项。我下面的代码不起作用:
from sklearn.preprocessing import OneHotEncoder
# data is a Pandas DataFrame
jobs_encoder = OneHotEncoder()
jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
It produces the following error (strings in the list are omitted):
它产生以下错误(列表中的字符串被省略):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-91-3a1f568322f5> in <module>()
3 jobs_encoder = OneHotEncoder()
4 jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
----> 5 data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
730 copy=True)
731 else:
--> 732 return self._transform_new(X)
733
734 def inverse_transform(self, X):
/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X)
678 """New implementation assuming categorical input"""
679 # validation of X happens in _check_X called by _transform
--> 680 X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
681
682 n_samples, n_features = X_int.shape
/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
120 msg = ("Found unknown categories {0} in column {1}"
121 " during transform".format(diff, i))
--> 122 raise ValueError(msg)
123 else:
124 # Set the problematic rows to an acceptable value and
ValueError: Found unknown categories ['...', ..., '...'] in column 0 during transform
Here's some sample data:
以下是一些示例数据:
data['Profession'] =
0 unkn
1 safe
2 rece
3 unkn
4 lead
...
111988 indu
111989 seni
111990 mess
111991 seni
111992 proj
Name: Profession, Length: 111993, dtype: object
What exactly am I doing wrong?
我到底做错了什么?
采纳答案by dd.
So turned out that Scikit-Learns LabelBinarizergave me better luck in converting the data to one-hot encoded format, with help from Amnie's solution, my final code is as follows
结果证明 Scikit-Learns LabelBinarizer在将数据转换为单热编码格式方面给了我更好的运气,在Amnie 的解决方案的帮助下,我的最终代码如下
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
jobs_encoder = LabelBinarizer()
jobs_encoder.fit(data['Profession'])
transformed = jobs_encoder.transform(data['Profession'])
ohe_df = pd.DataFrame(transformed)
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)
回答by Amine
OneHotEncoderEncodes categorical integer features as a one-hot numeric array. It's Transformmethod returns a sparse matrix if sparse=True else a 2-d array. You can't cast a 2-d array(or sparse matrix) into a Pandas Series. You must create a Pandas Serie (a column in a Pandas dataFrame) for each category.
OneHotEncoder将分类整数特征编码为 one-hot 数值数组。如果 sparse=True 则它的Transform方法返回一个稀疏矩阵,否则返回一个二维数组。您不能将二维数组(或稀疏矩阵)转换为Pandas Series。您必须为每个类别创建一个 Pandas Serie(Pandas 数据框中的一列)。
I would recommand to use pandas.get_dummiesinsted:
我建议使用pandas.get_dummies 安装:
data = pd.get_dummies(data,prefix=['Profession'], columns = ['Profession'], drop_first=True)
EDIT:
编辑:
Using Sklearn OneHotEncoder:
使用 Sklearn OneHotEncoder:
transformed = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(transformed, columns=jobs_encoder.get_feature_names())
#concat with original data
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)
Other Options:If you are doing hyperparameter tuning with GridSearchit's recommanded to use ColumnTransformerand FeatureUnionwith Pipelineor directly make_column_transformer
其他选项:如果您正在使用GridSearch进行超参数调整,建议使用ColumnTransformer和FeatureUnionwith Pipeline或直接使用make_column_transformer