pandas sklearn 中的多列单热编码和命名列

Question

提问by Gideon Blinick

I have the following code to one-hot-encode 2 columns I have.

我有以下代码可以对我拥有的 2 列进行单热编码。

# encode city labels using one-hot encoding scheme
city_ohe = OneHotEncoder(categories='auto')
city_feature_arr = city_ohe.fit_transform(df[['city']]).toarray()
city_feature_labels = city_ohe.categories_
city_features = pd.DataFrame(city_feature_arr, columns=city_feature_labels)

phone_ohe = OneHotEncoder(categories='auto')
phone_feature_arr = phone_ohe.fit_transform(df[['phone']]).toarray()
phone_feature_labels = phone_ohe.categories_
phone_features = pd.DataFrame(phone_feature_arr, columns=phone_feature_labels)

What I'm wondering is how I do this in 4 lines while getting properly named columns in the output. That is, I can create a properly one-hot-encoded array by include both columns names in fit_transformbut when I try and name the resulting dataframe's columns, it tells me that there is a mismatch between the shape of the indices:

我想知道的是如何在 4 行中执行此操作，同时在输出中正确命名列。也就是说，我可以通过包含两个列名称来创建一个正确的单热编码数组，fit_transform但是当我尝试命名结果数据框的列时，它告诉我索引的形状之间存在不匹配：

ValueError: Shape of passed values is (6, 50000), indices imply (3, 50000)

For background, both phone and city have 3 values.

对于背景，电话和城市都有 3 个值。

    city    phone
0   CityA   iPhone
1   CityB Android
2   CityB iPhone
3   CityA   iPhone
4   CityC   Android

Answer 1

回答by MaximeKan

You you are almost there... Like you said you can add all the columns you want to encode in fit_transformdirectly.

你快到了......就像你说的那样，你可以直接添加所有要编码的列fit_transform。

ohe = OneHotEncoder(categories='auto')
feature_arr = ohe.fit_transform(df[['phone','city']]).toarray()
feature_labels = ohe.categories_

And then you just need to do the following:

然后你只需要执行以下操作：

feature_labels = np.array(feature_labels).ravel()

Which enables you to name your columns like you wanted:

这使您可以根据需要命名列：

features = pd.DataFrame(feature_arr, columns=feature_labels)

Answer 2

回答by panktijk

Why don't you take a look at pd.get_dummies? Here's how you can encode:

你为什么不看看pd.get_dummies？以下是您可以编码的方法：

df['city'] = df['city'].astype('category')
df['phone'] = df['phone'].astype('category')
df = pd.get_dummies(df)

pandas sklearn 中的多列单热编码和命名列

提问by Gideon Blinick

回答by MaximeKan

回答by panktijk

相关推荐

最近更新

标签

pandas sklearn 中的多列单热编码和命名列

提问by Gideon Blinick

回答by MaximeKan

回答by panktijk

相关推荐

pandas 检查数据框列中的所有值是否相同

pandas 如何检查熊猫数据框是否仅包含数字列？

pandas.tools 在哪里

我什么时候应该在我的代码中使用 pandas apply() ？

相关推荐

最近更新

标签