pandas 处理标签编码的未知值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40321232/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:18:42  来源:igfitidea点击:

Handling unknown values for label encoding

pythonpandasscikit-learndummy-variableone-hot-encoding

提问by Georg Heiler

How can I handle unknown values for label encoding in sk-learn? The label encoder will only blow up with an exception that new labels were detected.

如何处理 sk-learn 中标签编码的未知值?标签编码器只会在检测到新标签的情况下爆炸。

What I want is the encoding of categorical variablesvia one-hot-encoder. However, sk-learn does not support strings for that. So I used a label encoder on each column.

我想要的是通过one-hot-encoder对分类变量进行编码。但是,sk-learn 不支持字符串。所以我在每列上使用了一个标签编码器。

My problem is that in my cross-validation step of the pipeline unknown labels show up. The basic one-hot-encoder would have the option to ignore such cases. An apriori pandas.getDummies /cat.codesis not sufficient as the pipeline should work with real-life, fresh incoming data which might contain unknown labels as well.

我的问题是在管道的交叉验证步骤中出现了未知标签。基本的单热编码器可以选择忽略这种情况。先验pandas.getDummies /cat.codes是不够的,因为管道应该处理现实生活中的新传入数据,这些数据也可能包含未知标签。

Would it be possible to use a CountVectorizerfor this purpose?

是否可以CountVectorizer为此目的使用 a ?

采纳答案by marbel

EDIT:

编辑:

A more recent simpler/better way of handling this problem with scikit-learn is using the class sklearn.preprocessing.OneHotEncoder

最近使用 scikit-learn 处理这个问题的一种更简单/更好的方法是使用类 sklearn.preprocessing.OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(train)

enc.transform(train).toarray()

Old answer:

旧答案:

There are several answers that mention pandas.get_dummiesas a method for this, but I feel the labelEncoderapproach is cleaner for implementing a model. Other similar answers mention using DictVectorizerfor this, but again converting the entire DataFrameto dict is probably not a great idea.

有几个答案提到pandas.get_dummies了一种方法,但我觉得这种labelEncoder方法对于实现模型来说更清晰。其他类似的答案提到DictVectorizer为此使用,但再次将整个转换DataFrame为 dict 可能不是一个好主意。

Let's assume the following problematic columns:

让我们假设以下有问题的列:

from sklearn import preprocessing
import numpy as np
import pandas as pd

train = {'city': ['Buenos Aires', 'New York', 'Istambul', 'Buenos Aires', 'Paris', 'Paris'],
        'letters': ['a', 'b', 'c', 'd', 'a', 'b']}
train = pd.DataFrame(train)

test = {'city': ['Buenos Aires', 'New York', 'Istambul', 'Buenos Aires', 'Paris', 'Utila'],
        'letters': ['a', 'b', 'c', 'a', 'b', 'b']}
test = pd.DataFrame(test)

Utila is a rarer city, and it isn't present in the training data but in the test set, that we can consider new data at inference time.

Utila 是一个比较少见的城市,它不存在于训练数据中,而是存在于测试集中,我们可以在推理时考虑新数据。

The trick is converting this value to "other" and including this in the labelEncoder object. Then we can reuse it in production.

诀窍是将此值转换为“其他”并将其包含在 labelEncoder 对象中。然后我们可以在生产中重用它。

c = 'city'
le = preprocessing.LabelEncoder()
train[c] = le.fit_transform(train[c])
test[c] = test[c].map(lambda s: 'other' if s not in le.classes_ else s)
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, 'other')
le.classes_ = le_classes
test[c] = le.transform(test[c])
test

  city  letters
0   1   a
1   3   b
2   2   c
3   1   a
4   4   b
5   0   b

To apply it to new data all we need is to save a leobject for each column which can be easily done with Pickle.

要将其应用于新数据,我们只需要le为每一列保存一个对象,这可以通过 Pickle 轻松完成。

This answer is based on this questionwhich I feel wasn't totally clear to me, therefore added this example.

这个答案是基于这个我觉得不是很清楚的问题,因此添加了这个例子。