Python scikit-learn 中跨多列的标签编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24458645/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:38:57  来源:igfitidea点击:

Label encoding across multiple columns in scikit-learn

pythonpandasscikit-learnneuraxle

提问by Bryan

I'm trying to use scikit-learn's LabelEncoderto encode a pandas DataFrameof string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoderobject for each column; I'd rather just have one big LabelEncoderobjects that works across allmy columns of data.

我正在尝试使用 scikit-learnLabelEncoder来编码DataFrame字符串标签的熊猫。由于数据框有很多(50+)列,我想避免LabelEncoder为每一列创建一个对象;我宁愿只拥有一个LabelEncoder适用于我所有数据列的大对象。

Throwing the entire DataFrameinto LabelEncodercreates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.

将整个DataFrame放入LabelEncoder会产生以下错误。请记住,我在这里使用的是虚拟数据;实际上,我正在处理大约 50 列标记为字符串的数据,因此需要一个不按名称引用任何列的解决方案。

import pandas
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

le = preprocessing.LabelEncoder()

le.fit(df)

Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)

回溯(最近一次调用):文件“”,第 1 行,在文件“/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py”,第 103 行,适合 y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape { 0}".format(shape)) ValueError: 错误的输入形状 (6, 3)

Any thoughts on how to get around this problem?

关于如何解决这个问题的任何想法?

回答by Fred Foo

No, LabelEncoderdoes not do this. It takes 1-d arrays of class labels and produces 1-d arrays. It's designed to handle class labels in classification problems, not arbitrary data, and any attempt to force it into other uses will require code to transform the actual problem to the problem it solves (and the solution back to the original space).

不,LabelEncoder不这样做。它需要类标签的一维数组并生成一维数组。它旨在处理分类问题中的类标签,而不是任意数据,任何将其强制用于其他用途的尝试都需要代码将实际问题转换为它解决的问题(并将解决方案返回到原始空间)。

回答by TehTechGuy

Assuming you are simply trying to get a sklearn.preprocessing.LabelEncoder()object that can be used to represent your columns, all you have to do is:

假设您只是想获得一个sklearn.preprocessing.LabelEncoder()可用于表示您的列的对象,您所要做的就是:

le.fit(df.columns)

In the above code you will have a unique number corresponding to each column. More precisely, you will have a 1:1 mapping of df.columnsto le.transform(df.columns.get_values()). To get a column's encoding, simply pass it to le.transform(...). As an example, the following will get the encoding for each column:

在上面的代码中,您将拥有与每一列对应的唯一编号。更准确地说,您将有一个 1:1 的df.columnsto映射le.transform(df.columns.get_values())。要获取列的编码,只需将其传递给le.transform(...). 例如,以下将获取每列的编码:

le.transform(df.columns.get_values())

Assuming you want to create a sklearn.preprocessing.LabelEncoder()object for all of your row labels you can do the following:

假设您要sklearn.preprocessing.LabelEncoder()为所有行标签创建一个对象,您可以执行以下操作:

le.fit([y for x in df.get_values() for y in x])

In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do le.classes_. You'll note that this should have the same elements as in set(y for x in df.get_values() for y in x). Once again to convert a row label to an encoded label use le.transform(...). As an example, if you want to retrieve the label for the first column in the df.columnsarray and the first row, you could do this:

在这种情况下,您很可能有非唯一的行标签(如您的问题所示)。要查看编​​码器创建的类,您可以执行哪些操作le.classes_。您会注意到这应该具有与 中相同的元素set(y for x in df.get_values() for y in x)。再次将行标签转换为编码标签使用le.transform(...). 例如,如果要检索df.columns数组中第一列和第一行的标签,可以执行以下操作:

le.transform([df.get_value(0, df.columns[0])])

The question you had in your comment is a bit more complicated, but can still be accomplished:

您在评论中提出的问题有点复杂,但仍然可以完成:

le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])

The above code does the following:

上面的代码执行以下操作:

  1. Make a unique combination of all of the pairs of (column, row)
  2. Represent each pair as a string version of the tuple. This is a workaround to overcome the LabelEncoderclass not supporting tuples as a class name.
  3. Fits the new items to the LabelEncoder.
  1. 对(列,行)的所有对进行唯一组合
  2. 将每一对表示为元组的字符串版本。这是一种解决LabelEncoder不支持元组作为类名的类的解决方法。
  3. 使新项目适合LabelEncoder.

Now to use this new model it's a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:

现在使用这个新模型有点复杂。假设我们要提取我们在上一个示例中查找的同一项目的表示(df.columns 中的第一列和第一行),我们可以这样做:

le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])

Remember that each lookup is now a string representation of a tuple that contains the (column, row).

请记住,每个查找现在都是包含 (column, row) 的元组的字符串表示形式。

回答by PriceHardman

As mentioned by larsmans, LabelEncoder() only takes a 1-d array as an argument. That said, it is quite easy to roll your own label encoder that operates on multiple columns of your choosing, and returns a transformed dataframe. My code here is based in part on Zac Stewart's excellent blog post found here.

正如 larsmans 所提到的,LabelEncoder() 只需要一个一维数组作为参数。也就是说,滚动您自己的标签编码器非常容易,该编码器在您选择的多个列上运行,并返回一个转换后的数据帧。我的代码部分基于 Zac Stewart 在此处找到的优秀博客文章。

Creating a custom encoder involves simply creating a class that responds to the fit(), transform(), and fit_transform()methods. In your case, a good start might be something like this:

创建自定义编码器包括简单地创建一个类来响应fit()transform()fit_transform()方法。在你的情况下,一个好的开始可能是这样的:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

# Create some toy data in a Pandas dataframe
fruit_data = pd.DataFrame({
    'fruit':  ['apple','orange','pear','orange'],
    'color':  ['red','orange','green','green'],
    'weight': [5,6,3,4]
})

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

Suppose we want to encode our two categorical attributes (fruitand color), while leaving the numeric attribute weightalone. We could do this as follows:

假设我们想对我们的两个分类属性(fruitcolor)进行编码,而weight单独留下数字属性。我们可以这样做:

MultiColumnLabelEncoder(columns = ['fruit','color']).fit_transform(fruit_data)

Which transforms our fruit_datadataset from

这将我们的fruit_data数据集从

enter image description hereto

在此处输入图片说明

enter image description here

在此处输入图片说明

Passing it a dataframe consisting entirely of categorical variables and omitting the columnsparameter will result in every column being encoded (which I believe is what you were originally looking for):

将一个完全由分类变量组成的数据帧传递给它并省略columns参数将导致每一列都被编码(我相信这是你最初寻找的):

MultiColumnLabelEncoder().fit_transform(fruit_data.drop('weight',axis=1))

This transforms

这转变

enter image description hereto

在此处输入图片说明

enter image description here.

在此处输入图片说明.

Note that it'll probably choke when it tries to encode attributes that are already numeric (add some code to handle this if you like).

请注意,当它尝试对已经是数字的属性进行编码时,它可能会窒息(如果您愿意,可以添加一些代码来处理这个问题)。

Another nice feature about this is that we can use this custom transformer in a pipeline:

另一个很好的特性是我们可以在管道中使用这个自定义转换器:

encoding_pipeline = Pipeline([
    ('encoding',MultiColumnLabelEncoder(columns=['fruit','color']))
    # add more pipeline steps as needed
])
encoding_pipeline.fit_transform(fruit_data)

回答by Napitupulu Jon

You can easily do this though,

你可以很容易地做到这一点,

df.apply(LabelEncoder().fit_transform)

EDIT2:

编辑2:

In scikit-learn 0.20, the recommended way is

在 scikit-learn 0.20 中,推荐的方式是

OneHotEncoder().fit_transform(df)

as the OneHotEncoder now supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.

因为 OneHotEncoder 现在支持字符串输入。使用 ColumnTransformer 可以将 OneHotEncoder 仅应用于某些列。

EDIT:

编辑:

Since this answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.

由于这个答案是一年多前的,并且产生了很多赞成(包括赏金),我可能应该进一步扩展它。

For inverse_transform and transform, you have to do a little bit of hack.

对于inverse_transform 和transform,你必须做一些小技巧。

from collections import defaultdict
d = defaultdict(LabelEncoder)

With this, you now retain all columns LabelEncoderas dictionary.

有了这个,您现在将所有列保留LabelEncoder为字典。

# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))

回答by Jason Wolosonovich

This is a year-and-a-half after the fact, but I too, needed to be able to .transform()multiple pandas dataframe columns at once (and be able to .inverse_transform()them as well). This expands upon the excellent suggestion of @PriceHardman above:

这是一年半之后的事实,但我也需要能够同时处理.transform()多个 Pandas 数据框列(并且也能够处理.inverse_transform()它们)。这扩展了上面@PriceHardman 的优秀建议:

class MultiColumnLabelEncoder(LabelEncoder):
    """
    Wraps sklearn LabelEncoder functionality for use on multiple columns of a
    pandas dataframe.

    """
    def __init__(self, columns=None):
        self.columns = columns

    def fit(self, dframe):
        """
        Fit label encoder to pandas columns.

        Access individual column classes via indexig `self.all_classes_`

        Access individual column encoders via indexing
        `self.all_encoders_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            for idx, column in enumerate(self.columns):
                # fit LabelEncoder to get `classes_` for the column
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                # append this column's encoder
                self.all_encoders_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return self

    def fit_transform(self, dframe):
        """
        Fit label encoder and return encoded labels.

        Access individual column classes via indexing
        `self.all_classes_`

        Access individual column encoders via indexing
        `self.all_encoders_`

        Access individual column encoded labels via indexing
        `self.all_labels_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            self.all_labels_ = np.ndarray(shape=self.columns.shape,
                                          dtype=object)
            for idx, column in enumerate(self.columns):
                # instantiate LabelEncoder
                le = LabelEncoder()
                # fit and transform labels in the column
                dframe.loc[:, column] =\
                    le.fit_transform(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
                self.all_labels_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                dframe.loc[:, column] = le.fit_transform(
                        dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return dframe.loc[:, self.columns].values

    def transform(self, dframe):
        """
        Transform labels to normalized encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[
                    idx].transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .transform(dframe.loc[:, column].values)
        return dframe.loc[:, self.columns].values

    def inverse_transform(self, dframe):
        """
        Transform labels back to original encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        return dframe.loc[:, self.columns].values

Example:

例子:

If dfand df_copy()are mixed-type pandasdataframes, you can apply the MultiColumnLabelEncoder()to the dtype=objectcolumns in the following way:

如果dfdf_copy()是混合型pandasdataframes,您可以应用MultiColumnLabelEncoder()dtype=object以下列方式列:

# get `object` columns
df_object_columns = df.iloc[:, :].select_dtypes(include=['object']).columns
df_copy_object_columns = df_copy.iloc[:, :].select_dtypes(include=['object']).columns

# instantiate `MultiColumnLabelEncoder`
mcle = MultiColumnLabelEncoder(columns=object_columns)

# fit to `df` data
mcle.fit(df)

# transform the `df` data
mcle.transform(df)

# returns output like below
array([[1, 0, 0, ..., 1, 1, 0],
       [0, 5, 1, ..., 1, 1, 2],
       [1, 1, 1, ..., 1, 1, 2],
       ..., 
       [3, 5, 1, ..., 1, 1, 2],

# transform `df_copy` data
mcle.transform(df_copy)

# returns output like below (assuming the respective columns 
# of `df_copy` contain the same unique values as that particular 
# column in `df`
array([[1, 0, 0, ..., 1, 1, 0],
       [0, 5, 1, ..., 1, 1, 2],
       [1, 1, 1, ..., 1, 1, 2],
       ..., 
       [3, 5, 1, ..., 1, 1, 2],

# inverse `df` data
mcle.inverse_transform(df)

# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
       ..., 
       ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)

# inverse `df_copy` data
mcle.inverse_transform(df_copy)

# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
       ..., 
       ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)

You can access individual column classes, column labels, and column encoders used to fit each column via indexing:

您可以通过索引访问用于拟合每列的各个列类、列标签和列编码器:

mcle.all_classes_
mcle.all_encoders_
mcle.all_labels_

mcle.all_classes_
mcle.all_encoders_
mcle.all_labels_

回答by Alexander

We don't need a LabelEncoder.

我们不需要 LabelEncoder。

You can convert the columns to categoricals and then get their codes. I used a dictionary comprehension below to apply this process to every column and wrap the result back into a dataframe of the same shape with identical indices and column names.

您可以将列转换为分类,然后获取它们的代码。我使用下面的字典理解将这个过程应用于每一列,并将结果包装回具有相同索引和列名的相同形状的数据帧。

>>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

To create a mapping dictionary, you can just enumerate the categories using a dictionary comprehension:

要创建映射字典,您可以使用字典理解来枚举类别:

>>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df}

{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

回答by Anurag Priyadarshi

this does not directly answer your question (for which Naputipulu Jon and PriceHardman have fantastic replies)

这并没有直接回答你的问题(Napitipulu Jon 和 PriceHardman 对此有很好的回答)

However, for the purpose of a few classification tasks etc. you could use

但是,出于一些分类任务等的目的,您可以使用

pandas.get_dummies(input_df) 

this can input dataframe with categorical data and return a dataframe with binary values. variable values are encoded into column names in the resulting dataframe. more

这可以输入具有分类数据的数据帧并返回具有二进制值的数据帧。变量值被编码到结果数据帧中的列名中。更多的

回答by Puja Sharma

if we have single column to do the label encoding and its inverse transform its easy how to do it when there are multiple columns in python

如果我们有单列来进行标签编码及其逆变换,那么当 python 中有多列时,如何轻松做到这一点

def stringtocategory(dataset):
    '''
    @author puja.sharma
    @see The function label encodes the object type columns and gives label      encoded and inverse tranform of the label encoded data
    @param dataset dataframe on whoes column the label encoding has to be done
    @return label encoded and inverse tranform of the label encoded data.
   ''' 
   data_original = dataset[:]
   data_tranformed = dataset[:]
   for y in dataset.columns:
       #check the dtype of the column object type contains strings or chars
       if (dataset[y].dtype == object):
          print("The string type features are  : " + y)
          le = preprocessing.LabelEncoder()
          le.fit(dataset[y].unique())
          #label encoded data
          data_tranformed[y] = le.transform(dataset[y])
          #inverse label transform  data
          data_original[y] = le.inverse_transform(data_tranformed[y])
   return data_tranformed,data_original

回答by Dror

Following up on the comments raised on the solution of @PriceHardmanI would propose the following version of the class:

跟进对@PriceHardman的解决方案提出的意见,我将提出以下版本的课程:

class LabelEncodingColoumns(BaseEstimator, TransformerMixin):
def __init__(self, cols=None):
    pdu._is_cols_input_valid(cols)
    self.cols = cols
    self.les = {col: LabelEncoder() for col in cols}
    self._is_fitted = False

def transform(self, df, **transform_params):
    """
    Scaling ``cols`` of ``df`` using the fitting

    Parameters
    ----------
    df : DataFrame
        DataFrame to be preprocessed
    """
    if not self._is_fitted:
        raise NotFittedError("Fitting was not preformed")
    pdu._is_cols_subset_of_df_cols(self.cols, df)

    df = df.copy()

    label_enc_dict = {}
    for col in self.cols:
        label_enc_dict[col] = self.les[col].transform(df[col])

    labelenc_cols = pd.DataFrame(label_enc_dict,
        # The index of the resulting DataFrame should be assigned and
        # equal to the one of the original DataFrame. Otherwise, upon
        # concatenation NaNs will be introduced.
        index=df.index
    )

    for col in self.cols:
        df[col] = labelenc_cols[col]
    return df

def fit(self, df, y=None, **fit_params):
    """
    Fitting the preprocessing

    Parameters
    ----------
    df : DataFrame
        Data to use for fitting.
        In many cases, should be ``X_train``.
    """
    pdu._is_cols_subset_of_df_cols(self.cols, df)
    for col in self.cols:
        self.les[col].fit(df[col])
    self._is_fitted = True
    return self

This class fits the encoder on the training set and uses the fitted version when transforming. Initial version of the code can be found here.

该类在训练集上拟合编码器,并在转换时使用拟合版本。可以在此处找到代码的初始版本。

回答by Tom

A short way to LabelEncoder()multiple columns with a dict():

LabelEncoder()多列的一种简短方法dict()

from sklearn.preprocessing import LabelEncoder
le_dict = {col: LabelEncoder() for col in columns }
for col in columns:
    le_dict[col].fit_transform(df[col])

and you can use this le_dictto labelEncode any other column:

您可以使用它le_dict来对任何其他列进行 labelEncode:

le_dict[col].transform(df_another[col])