Python sklearn.LabelEncoder 以前从未见过的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21057621/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:56:31  来源:igfitidea点击:

sklearn.LabelEncoder with never seen before values

pythonscikit-learn

提问by cjauvin

If a sklearn.LabelEncoderhas been fitted on a training set, it might break if it encounters new values when used on a test set.

如果 asklearn.LabelEncoder已在训练集上拟合,则在用于测试集时遇到新值可能会中断。

The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>", and then explicitly add a corresponding class to the LabelEncoderafterward:

我能想出的唯一解决方案是将测试集中的所有新内容(即不属于任何现有类)映射到"<unknown>",然后显式地将相应的类添加到LabelEncoder之后:

# train and test are pandas.DataFrame's and c is whatever column
le = LabelEncoder()
le.fit(train[c])
test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
le.classes_ = np.append(le.classes_, '<unknown>')
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])

This works, but is there a better solution?

这有效,但有更好的解决方案吗?

Update

更新

As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform, which now seems to use np.searchsorted(I don't know if it was the case before). So instead of appending the <unknown>class to the LabelEncoder's list of already extracted classes, it needs to be inserted in sorted order:

正如@sapo_cosmico 在评论中指出的那样,鉴于我假设是 中的实现更改LabelEncoder.transform,现在似乎使用了np.searchsorted(我不知道以前是否是这种情况),因此上述内容似乎不再起作用。因此,不是将<unknown>类附加到LabelEncoder已经提取的类的列表中,而是需要按排序顺序插入:

import bisect
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, '<unknown>')
le.classes_ = le_classes

However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.

但是,由于这总体上感觉很笨拙,因此我确信有更好的方法。

回答by lmjohns3

I get the impression that what you've done is quite similar to what other people do when faced with this situation.

我的印象是你所做的与其他人面对这种情况时所做的非常相似。

There's been some effort to add the ability to encode unseen labels to the LabelEncoder (see especially https://github.com/scikit-learn/scikit-learn/pull/3483and https://github.com/scikit-learn/scikit-learn/pull/3599), but changing the existing behavior is actually more difficult than it seems at first glance.

已经做了一些努力来将编码看不见的标签的能力添加到 LabelEncoder(特别参见https://github.com/scikit-learn/scikit-learn/pull/3483https://github.com/scikit-learn/ scikit-learn/pull/3599),但改变现有行为实际上比乍一看要困难得多。

For now it looks like handling "out-of-vocabulary" labels is left to individual users of scikit-learn.

现在看来,处理“词汇外”标签是留给 scikit-learn 的个人用户的。

回答by sapo_cosmico

I ended up switching to Pandas' get_dummiesdue to this problem of unseen data.

由于这个看不见的数据问题,我最终切换到 Pandas 的get_dummies

  • create the dummies on the training data
    dummy_train = pd.get_dummies(train)
  • create the dummies in the new (unseen data)
    dummy_new = pd.get_dummies(new_data)
  • re-index the new data to the columns of the training data, filling the missing values with 0
    dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
  • 在训练数据上创建假人
    dummy_train = pd.get_dummies(train)
  • 在新的(看不见的数据)中创建假人
    dummy_new = pd.get_dummies(new_data)
  • 将新数据重新索引到训练数据的列,用 0 填充缺失值
    dummy_new.reindex(columns = dummy_train.columns, fill_value=0)

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.

实际上,任何分类的新特征都不会进入分类器,但我认为这不会引起问题,因为它不知道如何处理它们。

回答by Jason

I know two devs that are working on building wrappers around transformers and Sklearn pipelines. They have 2 robust encoder transformers (one dummy and one label encoders) that can handle unseen values. Here is the documentation to their skutil library.Search for skutil.preprocessing.OneHotCategoricalEncoderor skutil.preprocessing.SafeLabelEncoder. In their SafeLabelEncoder(), unseen values are auto encoded to 999999.

我知道有两个开发人员正在围绕 Transformer 和 Sklearn 管道构建包装器。它们有 2 个强大的编码器转换器(一个虚拟编码器和一个标签编码器),可以处理看不见的值。这是他们的 skutil 库的文档。搜索skutil.preprocessing.OneHotCategoricalEncoderskutil.preprocessing.SafeLabelEncoder。在他们的 中SafeLabelEncoder(),看不见的值被自动编码为 999999。

回答by Yury Wallet

I was trying to deal with this problem and found two handy ways to encode categorical data from train and test sets with and without using LabelEncoder. New categories are filled with some known cetegory "c" (like "other" or "missing"). First method seems to work faster. Hope that will help you.

我试图解决这个问题,并找到了两种方便的方法来使用和不使用 LabelEncoder 对来自训练和测试集的分类数据进行编码。新类别填充了一些已知的类别“c”(如“其他”或“缺失”)。第一种方法似乎工作得更快。希望这会帮助你。

import pandas as pd
import time
df=pd.DataFrame()

df["a"]=['a','b', 'c', 'd']
df["b"]=['a','b', 'e', 'd']


#LabelEncoder + map
t=time.clock()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
suf="_le"
col="a"
df[col+suf] = le.fit_transform(df[col])
dic = dict(zip(le.classes_, le.transform(le.classes_)))
col='b'
df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int)
print(time.clock()-t)

#---
#pandas category

t=time.clock()
df["d"] = df["a"].astype('category').cat.codes
dic =df["a"].astype('category').cat.categories.tolist()
df['f']=df['b'].astype('category',categories=dic).fillna("c").cat.codes
df.dtypes
print(time.clock()-t)

回答by Namrata Tolani

If it is just about training and testing a model, why not just labelencode on entire dataset. And then use the generated classes from the encoder object.

如果只是关于训练和测试模型,为什么不只是对整个数据集进行标签编码。然后使用从编码器对象生成的类。

encoder = LabelEncoder()
encoder.fit_transform(df["label"])
train_y = encoder.transform(train_y)
test_y = encoder.transform(test_y)

回答by Ethan Kulla

I recently ran into this problem and was able to come up with a pretty quick solution to the problem. My answer solves a little more than just this problem but it will easily work for your issue too. (I think its pretty cool)

我最近遇到了这个问题,并且能够想出一个非常快速的解决方案。我的答案解决的不仅仅是这个问题,但它也很容易解决您的问题。(我觉得挺好看的)

I am working with pandas data frames and originally used the sklearns labelencoder() to encode my data which I would then pickle to use in other modules in my program.

我正在使用 Pandas 数据帧,最初使用 sklearns labelencoder() 对我的数据进行编码,然后我将在程序中的其他模块中使用pickle。

However, the label encoder in sklearn's preprocessing does not have the ability to add new values to the encoding algorithm. I solved the problem of encoding multiple values and saving the mapping values AS WELL as being able to add new values to the encoder by (here's a rough outline of what I did):

但是,sklearn 的预处理中的标签编码器不具备向编码算法添加新值的能力。我解决了编码多个值并将映射值保存为能够通过以下方式向编码器添加新值的问题(这是我所做的粗略概述):

encoding_dict = dict()
for col in cols_to_encode:
    #get unique values in the column to encode
    values = df[col].value_counts().index.tolist()

    # create a dictionary of values and corresponding number {value, number}
    dict_values = {value: count for value, count in zip(values, range(1,len(values)+1))}

    # save the values to encode in the dictionary
    encoding_dict[col] = dict_values

    # replace the values with the corresponding number from the dictionary
    df[col] = df[col].map(lambda x: dict_values.get(x))

Then you can simply save the dictionary to a JSON file and are able to pull it and add any value you want by adding a new value and the corresponding integer value.

然后,您可以简单地将字典保存到 JSON 文件中,并能够通过添加新值和相应的整数值来提取它并添加您想要的任何值。

I'll explain some reasoning behind using map() instead of replace(). I found that using pandas replace() function took over a minute to iterate through around 117,000 lines of code. Using map brought that time to just over 100 ms.

我将解释使用 map() 而不是 replace() 背后的一些原因。我发现使用 pandas replace() 函数需要花费一分钟的时间来遍历大约 117,000 行代码。使用地图使该时间刚刚超过 100 毫秒。

TLDR: instead of using sklearns preprocessing just work with your dataframe by making a mapping dictionary and map out the values yourself.

TLDR:不是使用 sklearns 预处理,而是通过制作映射字典并自己映射值来处理您的数据框。

回答by Rani

LabelEncoder is basically a dictionary. You can extract and use it for future encoding:

LabelEncoder 基本上是一本字典。您可以提取并将其用于将来的编码:

from sklearn.preprocessing import LabelEncoder

le = preprocessing.LabelEncoder()
le.fit(X)

le_dict = dict(zip(le.classes_, le.transform(le.classes_)))

Retrieve label for a single new item, if item is missing then set value as unknown

检索单个新项目的标签,如果项目丢失,则将值设置为未知

le_dict.get(new_item, '<Unknown>')

Retrieve labels for a Dataframe column:

检索 Dataframe 列的标签:

df[your_col].apply(lambda x: le_dict.get(x, <unknown_value>))

回答by Vinoj John Hosan

I have created a class to support this. If you have a new label comes, this will assign it as unknown class.

我创建了一个类来支持这一点。如果您有一个新标签,这会将其分配为未知类。

from sklearn.preprocessing import LabelEncoder
import numpy as np


class LabelEncoderExt(object):
    def __init__(self):
        """
        It differs from LabelEncoder by handling new classes and providing a value for it [Unknown]
        Unknown will be added in fit and transform will take care of new item. It gives unknown class id
        """
        self.label_encoder = LabelEncoder()
        # self.classes_ = self.label_encoder.classes_

    def fit(self, data_list):
        """
        This will fit the encoder for all the unique values and introduce unknown value
        :param data_list: A list of string
        :return: self
        """
        self.label_encoder = self.label_encoder.fit(list(data_list) + ['Unknown'])
        self.classes_ = self.label_encoder.classes_

        return self

    def transform(self, data_list):
        """
        This will transform the data_list to id list where the new values get assigned to Unknown class
        :param data_list:
        :return:
        """
        new_data_list = list(data_list)
        for unique_item in np.unique(data_list):
            if unique_item not in self.label_encoder.classes_:
                new_data_list = ['Unknown' if x==unique_item else x for x in new_data_list]

        return self.label_encoder.transform(new_data_list)

The sample usage:

示例用法:

country_list = ['Argentina', 'Australia', 'Canada', 'France', 'Italy', 'Spain', 'US', 'Canada', 'Argentina, ''US']

label_encoder = LabelEncoderExt()

label_encoder.fit(country_list)
print(label_encoder.classes_) # you can see new class called Unknown
print(label_encoder.transform(country_list))


new_country_list = ['Canada', 'France', 'Italy', 'Spain', 'US', 'India', 'Pakistan', 'South Africa']
print(label_encoder.transform(new_country_list))

回答by nonameforpirate

I face the same problem and realized that my encoder was somehow mixing values within my columns dataframe. Lets say that you run your encoder for several columns and when assigning numbers to labels the encoder automatically writes numbers to it and sometimes turns out that you have two different columns with similar values. What I did to solve the problem was to create an instance of LabelEncoder() for each column in my pandas DataFrame and I have a nice result.

我面临同样的问题,并意识到我的编码器以某种方式在我的列数据框中混合了值。假设您为多列运行编码器,并且在为标签分配数字时,编码器会自动将数字写入其中,有时会发现您有两个具有相似值的不同列。我为解决这个问题所做的是为我的 Pandas DataFrame 中的每一列创建一个 LabelEncoder() 实例,我有一个很好的结果。

encoder1 = LabelEncoder()
encoder2 = LabelEncoder()
encoder3 = LabelEncoder()

df['col1'] = encoder1.fit_transform(list(df['col1'].values))
df['col2'] = encoder2.fit_transform(list(df['col2'].values))
df['col3'] = encoder3.fit_transform(list(df['col3'].values))

Regards!!

问候!!

回答by Aung

Here is with the use of the relatively new feature from pandas. The main motivation is machine learning packages like 'lightgbm' can accept pandas category as feature columns and it is better than using onehotencoding in some situations. And in this example, the transformer return an integer but can also change the date type and replace with the unseen categorical values with -1.

这是使用熊猫的相对较新的功能。主要动机是像“lightgbm”这样的机器学习包可以接受熊猫类别作为特征列,并且在某些情况下比使用 onehotencoding 更好。在这个例子中,转换器返回一个整数,但也可以更改日期类型并用 -1 替换不可见的分类值。

from collections import defaultdict
from sklearn.base import BaseEstimator,TransformerMixin
from pandas.api.types import CategoricalDtype
import pandas as pd
import numpy as np

class PandasLabelEncoder(BaseEstimator,TransformerMixin):
    def __init__(self):
        self.label_dict = defaultdict(list)

    def fit(self, X):
        X = X.astype('category')
        cols = X.columns
        values = list(map(lambda col: X[col].cat.categories, cols))
        self.label_dict = dict(zip(cols,values))
        # return as category for xgboost or lightgbm 
        return self

    def transform(self,X):
        # check missing columns
        missing_col=set(X.columns)-set(self.label_dict.keys())
        if missing_col:
            raise ValueError('the column named {} is not in the label dictionary. Check your fitting data.'.format(missing_col)) 
        return X.apply(lambda x: x.astype('category').cat.set_categories(self.label_dict[x.name]).cat.codes.astype('category').cat.set_categories(np.arange(len(self.label_dict[x.name]))))


    def inverse_transform(self,X):
        return X.apply(lambda x: pd.Categorical.from_codes(codes=x.values,
                                                           categories=self.label_dict[x.name]))

dff1 = pd.DataFrame({'One': list('ABCC'), 'Two': list('bccd')})
dff2 = pd.DataFrame({'One': list('ABCDE'), 'Two': list('debca')})


enc=PandasLabelEncoder()
enc.fit_transform(dff1)
One Two
0   0   0
1   1   1
2   2   1
3   2   2
dff3=enc.transform(dff2)
dff3
    One Two
0   0   2
1   1   -1
2   2   0
3   -1  1
4   -1  -1
enc.inverse_transform(dff3)
One Two
0   A   d
1   B   NaN
2   C   b
3   NaN c
4   NaN NaN