如何在 Python 中进行一次热编码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37292872/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:11:42  来源:igfitidea点击:

How can I one hot encode in Python?

pythonpandasmachine-learninganacondaone-hot-encoding

提问by avicohen

I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?

我有一个包含 80% 分类变量的机器学习分类问题。如果我想使用某个分类器进行分类,我必须使用一种热编码吗?我可以将数据传递给没有编码的分类器吗?

I am trying to do the following for feature selection:

我正在尝试为特征选择执行以下操作:

  1. I read the train file:

    num_rows_to_read = 10000
    train_small = pd.read_csv("../../dataset/train.csv",   nrows=num_rows_to_read)
    
  2. I change the type of the categorical features to 'category':

    non_categorial_features = ['orig_destination_distance',
                              'srch_adults_cnt',
                              'srch_children_cnt',
                              'srch_rm_cnt',
                              'cnt']
    
    for categorical_feature in list(train_small.columns):
        if categorical_feature not in non_categorial_features:
            train_small[categorical_feature] = train_small[categorical_feature].astype('category')
    
  3. I use one hot encoding:

    train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
    
  1. 我读了火车文件:

    num_rows_to_read = 10000
    train_small = pd.read_csv("../../dataset/train.csv",   nrows=num_rows_to_read)
    
  2. 我将分类特征的类型更改为“类别”:

    non_categorial_features = ['orig_destination_distance',
                              'srch_adults_cnt',
                              'srch_children_cnt',
                              'srch_rm_cnt',
                              'cnt']
    
    for categorical_feature in list(train_small.columns):
        if categorical_feature not in non_categorial_features:
            train_small[categorical_feature] = train_small[categorical_feature].astype('category')
    
  3. 我使用一种热编码:

    train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
    

The problem is that the 3'rd part often get stuck, although I am using a strong machine.

问题是第三部分经常卡住,尽管我使用的是强大的机器。

Thus, without the one hot encoding I can't do any feature selection, for determining the importance of the features.

因此,如果没有一种热编码,我将无法进行任何特征选择,以确定特征的重要性。

What do you recommend?

你有什么建议吗?

回答by Sayali Sonawane

Approach 1: You can use get_dummies on pandas dataframe.

方法 1:您可以在 pandas 数据帧上使用 get_dummies。

Example 1:

示例 1:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]: 
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
3  1.0  0.0  0.0

Example 2:

示例 2:

The following will transform a given column into one hot. Use prefix to have multiple dummies.

下面将把一个给定的列转换成一个热点。使用前缀有多个假人。

import pandas as pd

df = pd.DataFrame({
          'A':['a','b','a'],
          'B':['b','a','c']
        })
df
Out[]: 
   A  B
0  a  b
1  b  a
2  a  c

# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df  
Out[]: 
       A  a  b  c
    0  a  0  1  0
    1  b  1  0  0
    2  a  0  0  1

Approach 2: Use Scikit-learn

方法二:使用 Scikit-learn

Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.

给定具有三个特征和四个样本的数据集,我们让编码器找到每个特征的最大值并将数据转换为二进制 one-hot 编码。

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])   
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Here is the link for this example: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

这是此示例的链接:http: //scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

回答by Cybernetic

Much easier to use Pandas for basic one-hot encoding. If you're looking for more options you can use scikit-learn.

使用 Pandas 进行基本的 one-hot 编码要容易得多。如果您正在寻找更多选项,可以使用scikit-learn.

For basic one-hot encoding with Pandasyou simply pass your data frame into the get_dummiesfunction.

对于Pandas 的基本单热编码,您只需将数据帧传递给get_dummies函数。

For example, if I have a dataframe called imdb_movies:

例如,如果我有一个名为imdb_movies的数据

enter image description here

在此处输入图片说明

...and I want to one-hot encode the Rated column, I simply do this:

...我想对 Rated 列进行单热编码,我只是这样做:

pd.get_dummies(imdb_movies.Rated)

enter image description here

在此处输入图片说明

This returns a new dataframewith a column for every "level" of rating that exists, along with either a 1 or 0 specifying the presence of that rating for a given observation.

这将返回一个新的,dataframe其中包含存在的每个“级别”评级的列,以及指定给定观察的评级存在的 1 或 0。

Usually, we want this to be part of the original dataframe. In this case, we simply attach our new dummy coded frame onto the original frame using "column-binding.

通常,我们希望它成为原始dataframe. 在这种情况下,我们只需使用“列绑定”将新的虚拟编码帧附加到原始帧上。

We can column-bind by using Pandas concatfunction:

我们可以使用 Pandas concat函数进行列绑定:

rated_dummies = pd.get_dummies(imdb_movies.Rated)
pd.concat([imdb_movies, rated_dummies], axis=1)

enter image description here

在此处输入图片说明

We can now run an analysis on our full dataframe.

我们现在可以对完整的dataframe.

SIMPLE UTILITY FUNCTION

简单的实用功能

I would recommend making yourself a utility functionto do this quickly:

我建议让自己成为一个实用函数来快速做到这一点:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    return(res)

Usage:

用法

encode_and_bind(imdb_movies, 'Rated')

Result:

结果

enter image description here

在此处输入图片说明

Also, as per @pmalbu comment, if you would like the function to remove the original feature_to_encodethen use this version:

此外,根据@pmalbu 评论,如果您希望该函数删除原始 feature_to_encode,请使用此版本:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res) 

You can encode multiple features at the same time as following:

您可以同时对多个特征进行编码,如下所示:

features_to_encode = ['feature_1', 'feature_2', 'feature_3',
                      'feature_4']
for feature in features_to_encode:
    res = encode_and_bind(train_set, feature)

回答by Martin Thoma

You can do it with numpy.eyeand a using the array element selection mechanism:

您可以numpy.eye使用数组元素选择机制和 a 来做到这一点:

import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]

def indices_to_one_hot(data, nb_classes):
    """Convert an iterable of indices to one-hot encoded labels."""
    targets = np.array(data).reshape(-1)
    return np.eye(nb_classes)[targets]

The the return value of indices_to_one_hot(nb_classes, data)is now

现在的返回值indices_to_one_hot(nb_classes, data)

array([[[ 0.,  0.,  1.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  1.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  1.,  0.],
        [ 1.,  0.,  0.,  0.,  0.,  0.]]])

The .reshape(-1)is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]).

.reshape(-1)是有,以确保您有正确的标签格式(你可能也有[[2], [3], [4], [0]])。

回答by Wboy

Firstly, easiest way to one hot encode: use Sklearn.

首先,一种热编码的最简单方法:使用 Sklearn。

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Secondly, I don't think using pandas to one hot encode is that simple (unconfirmed though)

其次,我不认为使用 Pandas 进行一个热编码是那么简单(虽然未经证实)

Creating dummy variables in pandas for python

在 Pandas 中为 python 创建虚拟变量

Lastly, is it necessary for you to one hot encode? One hot encoding exponentially increases the number of features, drastically increasing the run time of any classifier or anything else you are going to run. Especially when each categorical feature has many levels. Instead you can do dummy coding.

最后,您是否有必要进行一次热编码?一种热编码以指数方式增加了特征的数量,大大增加了任何分类器或您将要运行的任何其他东西的运行时间。尤其是当每个分类特征有多个级别时。相反,您可以进行虚拟编码。

Using dummy encoding usually works well, for much less run time and complexity. A wise prof once told me, 'Less is More'.

使用虚拟编码通常效果很好,运行时间和复杂性要少得多。一位睿智的教授曾告诉我,“少即是多”。

Here's the code for my custom encoding function if you want.

如果需要,这是我的自定义编码函数的代码。

from sklearn.preprocessing import LabelEncoder

#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

EDIT: Comparison to be clearer:

编辑:比较更清楚:

One-hot encoding: convert n levels to n-1 columns.

一键编码:将 n 级转换为 n-1 列。

Index  Animal         Index  cat  mouse
  1     dog             1     0     0
  2     cat       -->   2     1     0
  3    mouse            3     0     1

You can see how this will explode your memory if you have many different types (or levels) in your categorical feature. Keep in mind, this is just ONE column.

如果您的分类特征中有许多不同的类型(或级别),您可以看到这将如何爆炸您的记忆。请记住,这只是一列。

Dummy Coding:

虚拟编码:

Index  Animal         Index  Animal
  1     dog             1      0   
  2     cat       -->   2      1 
  3    mouse            3      2

Convert to numerical representations instead. Greatly saves feature space, at the cost of a bit of accuracy.

改为转换为数字表示。以牺牲一点精度为代价,大大节省了特征空间。

回答by Qy Zuo

One hot encoding with pandas is very easy:

使用 pandas 进行的一种热编码非常简单:

def one_hot(df, cols):
    """
    @param df pandas DataFrame
    @param cols a list of columns to encode 
    @return a DataFrame with one-hot encoding
    """
    for each in cols:
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df

EDIT:

编辑:

Another way to one_hot using sklearn's LabelBinarizer:

使用 sklearn 的 one_hot 的另一种方法LabelBinarizer

from sklearn.preprocessing import LabelBinarizer 
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later

def one_hot_encode(x):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
    """
    return label_binarizer.transform(x)

回答by Dieter

You can use numpy.eye function.

您可以使用 numpy.eye 功能。

import numpy as np

def one_hot_encode(x, n_classes):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
     """
    return np.eye(n_classes)[x]

def main():
    list = [0,1,2,3,4,3,2,1,0]
    n_classes = 5
    one_hot_list = one_hot_encode(list, n_classes)
    print(one_hot_list)

if __name__ == "__main__":
    main()

Result

结果

D:\Desktop>python test.py
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]]

回答by Arshdeep Singh

pandas as has inbuilt function "get_dummies" to get one hot encoding of that particular column/s.

pandas 具有内置函数“get_dummies”,以获得该特定列的一种热编码。

one line code for one-hot-encoding:

one-hot-encoding的一行代码:

df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)

回答by Josh Morel

Here is a solution using DictVectorizerand the Pandas DataFrame.to_dict('records')method.

这是使用DictVectorizerPandasDataFrame.to_dict('records')方法的解决方案。

>>> import pandas as pd
>>> X = pd.DataFrame({'income': [100000,110000,90000,30000,14000,50000],
                      'country':['US', 'CAN', 'US', 'CAN', 'MEX', 'US'],
                      'race':['White', 'Black', 'Latino', 'White', 'White', 'Black']
                     })

>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer()
>>> qualitative_features = ['country','race']
>>> X_qual = v.fit_transform(X[qualitative_features].to_dict('records'))
>>> v.vocabulary_
{'country=CAN': 0,
 'country=MEX': 1,
 'country=US': 2,
 'race=Black': 3,
 'race=Latino': 4,
 'race=White': 5}

>>> X_qual.toarray()
array([[ 0.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  1.,  0.,  0.]])

回答by Tukeys

One-hot encoding requires bit more than converting the values to indicator variables. Typically ML process requires you to apply this coding several times to validation or test data sets and applying the model you construct to real-time observed data. You should store the mapping (transform) that was used to construct the model. A good solution would use the DictVectorizeror LabelEncoder(followed by get_dummies. Here is a function that you can use:

One-hot 编码需要的不仅仅是将值转换为指标变量。通常,ML 过程要求您多次将此编码应用于验证或测试数据集,并将您构建的模型应用于实时观察数据。您应该存储用于构建模型的映射(转换)。一个好的解决方案是使用DictVectorizeror LabelEncoder(后跟get_dummies。这是您可以使用的函数:

def oneHotEncode2(df, le_dict = {}):
    if not le_dict:
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        train = True;
    else:
        columnsToEncode = le_dict.keys()   
        train = False;

    for feature in columnsToEncode:
        if train:
            le_dict[feature] = LabelEncoder()
        try:
            if train:
                df[feature] = le_dict[feature].fit_transform(df[feature])
            else:
                df[feature] = le_dict[feature].transform(df[feature])

            df = pd.concat([df, 
                              pd.get_dummies(df[feature]).rename(columns=lambda x: feature + '_' + str(x))], axis=1)
            df = df.drop(feature, axis=1)
        except:
            print('Error encoding '+feature)
            #df[feature]  = df[feature].convert_objects(convert_numeric='force')
            df[feature]  = df[feature].apply(pd.to_numeric, errors='coerce')
    return (df, le_dict)

This works on a pandas dataframe and for each column of the dataframe it creates and returns a mapping back. So you would call it like this:

这适用于熊猫数据框,并为数据框的每一列创建并返回一个映射。所以你会这样称呼它:

train_data, le_dict = oneHotEncode2(train_data)

Then on the test data, the call is made by passing the dictionary returned back from training:

然后在测试数据上,通过传递训练返回的字典来进行调用:

test_data, _ = oneHotEncode2(test_data, le_dict)

An equivalent method is to use DictVectorizer. A related post on the same is on my blog. I mention it here since it provides some reasoning behind this approach over simply using get_dummies post(disclosure: this is my own blog).

一种等效的方法是使用DictVectorizer. 我的博客上有一篇相关的文章。我在这里提到它是因为它提供了这种方法背后的一些推理,而不是简单地使用 get_dummies帖子(披露:这是我自己的博客)。

回答by Garima Jain

You can pass the data to catboost classifier without encoding. Catboost handles categorical variables itself by performing one-hot and target expanding mean encoding.

您可以将数据传递给 catboost 分类器而无需​​编码。Catboost 通过执行 one-hot 和目标扩展均值编码来处理分类变量本身。