pandas 如何在 sklearn 中使用 OneHotEncoder 的输出?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38514682/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:38:40  来源:igfitidea点击:

How to use the output from OneHotEncoder in sklearn?

pythonpandasscikit-learnclassificationone-hot-encoding

提问by Bert Carremans

I have a Pandas Dataframe with 2 categorical variables, and ID variable and a target variable (for classification). I managed to convert the categorical values with OneHotEncoder. This results in a sparse matrix.

我有一个 Pandas 数据框,其中包含 2 个分类变量、ID 变量和一个目标变量(用于分类)。我设法将分类值转换为OneHotEncoder. 这导致稀疏矩阵。

ohe = OneHotEncoder()
# First I remapped the string values in the categorical variables to integers as OneHotEncoder needs integers as input
... remapping code ...

ohe.fit(df[['col_a', 'col_b']])
ohe.transform(df[['col_a', 'col_b']])

But I have no clue how I can use this sparse matrix in a DecisionTreeClassifier? Especially when I want to add some other non-categorical variables in my dataframe later on. Thanks!

但是我不知道如何在 DecisionTreeClassifier 中使用这个稀疏矩阵?特别是当我稍后想在我的数据框中添加一些其他非分类变量时。谢谢!

EDITIn reply to the comment of miraculixx: I also tried the DataFrameMapper in sklearn-pandas

编辑回复 miraculixx 的评论:我也在 sklearn-pandas 中尝试了 DataFrameMapper

mapper = DataFrameMapper([
    ('id_col', None),
    ('target_col', None),
    (['col_a'], OneHotEncoder()),
    (['col_b'], OneHotEncoder())
])

t = mapper.fit_transform(df)

But then I get this error:

但是后来我收到了这个错误:

TypeError: no supported conversion for types : (dtype('O'), dtype('int64'), dtype('float64'), dtype('float64')).

类型错误:不支持类型转换:(dtype('O')、dtype('int64')、dtype('float64')、dtype('float64'))。

回答by Guiem Bosch

I see you are already using Pandas, so why not using its get_dummiesfunction?

我看到你已经在使用 Pandas,那么为什么不使用它的get_dummies功能呢?

import pandas as pd
df = pd.DataFrame([['rick','young'],['phil','old'],['john','teenager']],columns=['name','age-group'])

result

结果

   name age-group
0  rick     young
1  phil       old
2  john  teenager

now you encode with get_dummies

现在你用 get_dummies 编码

pd.get_dummies(df)

result

结果

name_john  name_phil  name_rick  age-group_old  age-group_teenager  \
0          0          0          1              0                   0   
1          0          1          0              1                   0   
2          1          0          0              0                   1   

   age-group_young  
0                1  
1                0  
2                0

And you can actually use the new Pandas DataFrame in your Sklearn's DecisionTreeClassifier.

您实际上可以在 Sklearn 的 DecisionTreeClassifier 中使用新的 Pandas DataFrame。

回答by Merlin

Look at this example from scikit-learn: http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py

从 scikit-learn 看这个例子:http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py

Problem is that you are not using the sparse matrices to xx.fit(). You are using the original data.

问题是您没有将稀疏矩阵用于xx.fit(). 您正在使用原始数据。