pandas 如何在 sklearn 中使用 OneHotEncoder 的输出?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38514682/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use the output from OneHotEncoder in sklearn?
提问by Bert Carremans
I have a Pandas Dataframe with 2 categorical variables, and ID variable and a target variable (for classification). I managed to convert the categorical values with OneHotEncoder
. This results in a sparse matrix.
我有一个 Pandas 数据框,其中包含 2 个分类变量、ID 变量和一个目标变量(用于分类)。我设法将分类值转换为OneHotEncoder
. 这导致稀疏矩阵。
ohe = OneHotEncoder()
# First I remapped the string values in the categorical variables to integers as OneHotEncoder needs integers as input
... remapping code ...
ohe.fit(df[['col_a', 'col_b']])
ohe.transform(df[['col_a', 'col_b']])
But I have no clue how I can use this sparse matrix in a DecisionTreeClassifier? Especially when I want to add some other non-categorical variables in my dataframe later on. Thanks!
但是我不知道如何在 DecisionTreeClassifier 中使用这个稀疏矩阵?特别是当我稍后想在我的数据框中添加一些其他非分类变量时。谢谢!
EDITIn reply to the comment of miraculixx: I also tried the DataFrameMapper in sklearn-pandas
编辑回复 miraculixx 的评论:我也在 sklearn-pandas 中尝试了 DataFrameMapper
mapper = DataFrameMapper([
('id_col', None),
('target_col', None),
(['col_a'], OneHotEncoder()),
(['col_b'], OneHotEncoder())
])
t = mapper.fit_transform(df)
But then I get this error:
但是后来我收到了这个错误:
TypeError: no supported conversion for types : (dtype('O'), dtype('int64'), dtype('float64'), dtype('float64')).
类型错误:不支持类型转换:(dtype('O')、dtype('int64')、dtype('float64')、dtype('float64'))。
回答by Guiem Bosch
I see you are already using Pandas, so why not using its get_dummies
function?
我看到你已经在使用 Pandas,那么为什么不使用它的get_dummies
功能呢?
import pandas as pd
df = pd.DataFrame([['rick','young'],['phil','old'],['john','teenager']],columns=['name','age-group'])
result
结果
name age-group
0 rick young
1 phil old
2 john teenager
now you encode with get_dummies
现在你用 get_dummies 编码
pd.get_dummies(df)
result
结果
name_john name_phil name_rick age-group_old age-group_teenager \
0 0 0 1 0 0
1 0 1 0 1 0
2 1 0 0 0 1
age-group_young
0 1
1 0
2 0
And you can actually use the new Pandas DataFrame in your Sklearn's DecisionTreeClassifier.
您实际上可以在 Sklearn 的 DecisionTreeClassifier 中使用新的 Pandas DataFrame。
回答by Merlin
Look at this example from scikit-learn: http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py
从 scikit-learn 看这个例子:http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py
Problem is that you are not using the sparse matrices to xx.fit()
. You are using the original data.
问题是您没有将稀疏矩阵用于xx.fit()
. 您正在使用原始数据。