pandas sklearn 中的多列单热编码和命名列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/55229301/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:21:19  来源:igfitidea点击:

One-hot-encoding multiple columns in sklearn and naming columns

pythonpython-3.xpandasscikit-learnone-hot-encoding

提问by Gideon Blinick

I have the following code to one-hot-encode 2 columns I have.

我有以下代码可以对我拥有的 2 列进行单热编码。

# encode city labels using one-hot encoding scheme
city_ohe = OneHotEncoder(categories='auto')
city_feature_arr = city_ohe.fit_transform(df[['city']]).toarray()
city_feature_labels = city_ohe.categories_
city_features = pd.DataFrame(city_feature_arr, columns=city_feature_labels)

phone_ohe = OneHotEncoder(categories='auto')
phone_feature_arr = phone_ohe.fit_transform(df[['phone']]).toarray()
phone_feature_labels = phone_ohe.categories_
phone_features = pd.DataFrame(phone_feature_arr, columns=phone_feature_labels)

What I'm wondering is how I do this in 4 lines while getting properly named columns in the output. That is, I can create a properly one-hot-encoded array by include both columns names in fit_transformbut when I try and name the resulting dataframe's columns, it tells me that there is a mismatch between the shape of the indices:

我想知道的是如何在 4 行中执行此操作,同时在输出中正确命名列。也就是说,我可以通过包含两个列名称来创建一个正确的单热编码数组,fit_transform但是当我尝试命名结果数据框的列时,它告诉我索引的形状之间存在不匹配:

ValueError: Shape of passed values is (6, 50000), indices imply (3, 50000)

For background, both phone and city have 3 values.

对于背景,电话和城市都有 3 个值。

    city    phone
0   CityA   iPhone
1   CityB Android
2   CityB iPhone
3   CityA   iPhone
4   CityC   Android

回答by MaximeKan

You you are almost there... Like you said you can add all the columns you want to encode in fit_transformdirectly.

你快到了......就像你说的那样,你可以直接添加所有要编码的列fit_transform

ohe = OneHotEncoder(categories='auto')
feature_arr = ohe.fit_transform(df[['phone','city']]).toarray()
feature_labels = ohe.categories_

And then you just need to do the following:

然后你只需要执行以下操作:

feature_labels = np.array(feature_labels).ravel()

Which enables you to name your columns like you wanted:

这使您可以根据需要命名列:

features = pd.DataFrame(feature_arr, columns=feature_labels)

回答by panktijk

Why don't you take a look at pd.get_dummies? Here's how you can encode:

你为什么不看看pd.get_dummies?以下是您可以编码的方法:

df['city'] = df['city'].astype('category')
df['phone'] = df['phone'].astype('category')
df = pd.get_dummies(df)