pandas 如何在 sklearn 中使用 OneHotEncoder 的输出？

Question

提问by Bert Carremans

I have a Pandas Dataframe with 2 categorical variables, and ID variable and a target variable (for classification). I managed to convert the categorical values with OneHotEncoder. This results in a sparse matrix.

我有一个 Pandas 数据框，其中包含 2 个分类变量、ID 变量和一个目标变量（用于分类）。我设法将分类值转换为OneHotEncoder. 这导致稀疏矩阵。

ohe = OneHotEncoder()
# First I remapped the string values in the categorical variables to integers as OneHotEncoder needs integers as input
... remapping code ...

ohe.fit(df[['col_a', 'col_b']])
ohe.transform(df[['col_a', 'col_b']])

But I have no clue how I can use this sparse matrix in a DecisionTreeClassifier? Especially when I want to add some other non-categorical variables in my dataframe later on. Thanks!

但是我不知道如何在 DecisionTreeClassifier 中使用这个稀疏矩阵？特别是当我稍后想在我的数据框中添加一些其他非分类变量时。谢谢！

EDITIn reply to the comment of miraculixx: I also tried the DataFrameMapper in sklearn-pandas

编辑回复 miraculixx 的评论：我也在 sklearn-pandas 中尝试了 DataFrameMapper

mapper = DataFrameMapper([
    ('id_col', None),
    ('target_col', None),
    (['col_a'], OneHotEncoder()),
    (['col_b'], OneHotEncoder())
])

t = mapper.fit_transform(df)

But then I get this error:

但是后来我收到了这个错误：

TypeError: no supported conversion for types : (dtype('O'), dtype('int64'), dtype('float64'), dtype('float64')).

类型错误：不支持类型转换：(dtype('O')、dtype('int64')、dtype('float64')、dtype('float64'))。

Answer 1

回答by Guiem Bosch

I see you are already using Pandas, so why not using its get_dummiesfunction?

我看到你已经在使用 Pandas，那么为什么不使用它的get_dummies功能呢？

import pandas as pd
df = pd.DataFrame([['rick','young'],['phil','old'],['john','teenager']],columns=['name','age-group'])

result

结果

   name age-group
0  rick     young
1  phil       old
2  john  teenager

now you encode with get_dummies

现在你用 get_dummies 编码

pd.get_dummies(df)

result

结果

name_john  name_phil  name_rick  age-group_old  age-group_teenager  \
0          0          0          1              0                   0   
1          0          1          0              1                   0   
2          1          0          0              0                   1   

   age-group_young  
0                1  
1                0  
2                0

And you can actually use the new Pandas DataFrame in your Sklearn's DecisionTreeClassifier.

您实际上可以在 Sklearn 的 DecisionTreeClassifier 中使用新的 Pandas DataFrame。

Answer 2

回答by Merlin

Look at this example from scikit-learn: http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py

从 scikit-learn 看这个例子：http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py

Problem is that you are not using the sparse matrices to xx.fit(). You are using the original data.

问题是您没有将稀疏矩阵用于xx.fit(). 您正在使用原始数据。

pandas 如何在 sklearn 中使用 OneHotEncoder 的输出？

提问by Bert Carremans

回答by Guiem Bosch

回答by Merlin

相关推荐

最近更新

标签

pandas 如何在 sklearn 中使用 OneHotEncoder 的输出？

提问by Bert Carremans

回答by Guiem Bosch

回答by Merlin

相关推荐

计算 Pandas 中一系列趋势线的斜率

pandas 熊猫 pd.isnull() 函数

每个唯一值采样一条记录（pandas、python）

pandas sklearn LabelEncoder 和 pd.get_dummies 有什么区别？

相关推荐

最近更新

标签