pandas 处理标签编码的未知值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40321232/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Handling unknown values for label encoding
提问by Georg Heiler
How can I handle unknown values for label encoding in sk-learn? The label encoder will only blow up with an exception that new labels were detected.
如何处理 sk-learn 中标签编码的未知值?标签编码器只会在检测到新标签的情况下爆炸。
What I want is the encoding of categorical variablesvia one-hot-encoder. However, sk-learn does not support strings for that. So I used a label encoder on each column.
我想要的是通过one-hot-encoder对分类变量进行编码。但是,sk-learn 不支持字符串。所以我在每列上使用了一个标签编码器。
My problem is that in my cross-validation step of the pipeline unknown labels show up.
The basic one-hot-encoder would have the option to ignore such cases.
An apriori pandas.getDummies /cat.codes
is not sufficient as the pipeline should work with real-life, fresh incoming data which might contain unknown labels as well.
我的问题是在管道的交叉验证步骤中出现了未知标签。基本的单热编码器可以选择忽略这种情况。先验pandas.getDummies /cat.codes
是不够的,因为管道应该处理现实生活中的新传入数据,这些数据也可能包含未知标签。
Would it be possible to use a CountVectorizer
for this purpose?
是否可以CountVectorizer
为此目的使用 a ?
采纳答案by marbel
EDIT:
编辑:
A more recent simpler/better way of handling this problem with scikit-learn is using the class sklearn.preprocessing.OneHotEncoder
最近使用 scikit-learn 处理这个问题的一种更简单/更好的方法是使用类 sklearn.preprocessing.OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(train)
enc.transform(train).toarray()
Old answer:
旧答案:
There are several answers that mention pandas.get_dummies
as a method for this, but I feel the labelEncoder
approach is cleaner for implementing a model.
Other similar answers mention using DictVectorizer
for this, but again converting the entire DataFrame
to dict is probably not a great idea.
有几个答案提到pandas.get_dummies
了一种方法,但我觉得这种labelEncoder
方法对于实现模型来说更清晰。其他类似的答案提到DictVectorizer
为此使用,但再次将整个转换DataFrame
为 dict 可能不是一个好主意。
Let's assume the following problematic columns:
让我们假设以下有问题的列:
from sklearn import preprocessing
import numpy as np
import pandas as pd
train = {'city': ['Buenos Aires', 'New York', 'Istambul', 'Buenos Aires', 'Paris', 'Paris'],
'letters': ['a', 'b', 'c', 'd', 'a', 'b']}
train = pd.DataFrame(train)
test = {'city': ['Buenos Aires', 'New York', 'Istambul', 'Buenos Aires', 'Paris', 'Utila'],
'letters': ['a', 'b', 'c', 'a', 'b', 'b']}
test = pd.DataFrame(test)
Utila is a rarer city, and it isn't present in the training data but in the test set, that we can consider new data at inference time.
Utila 是一个比较少见的城市,它不存在于训练数据中,而是存在于测试集中,我们可以在推理时考虑新数据。
The trick is converting this value to "other" and including this in the labelEncoder object. Then we can reuse it in production.
诀窍是将此值转换为“其他”并将其包含在 labelEncoder 对象中。然后我们可以在生产中重用它。
c = 'city'
le = preprocessing.LabelEncoder()
train[c] = le.fit_transform(train[c])
test[c] = test[c].map(lambda s: 'other' if s not in le.classes_ else s)
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, 'other')
le.classes_ = le_classes
test[c] = le.transform(test[c])
test
city letters
0 1 a
1 3 b
2 2 c
3 1 a
4 4 b
5 0 b
To apply it to new data all we need is to save a le
object for each column which can be easily done with Pickle.
要将其应用于新数据,我们只需要le
为每一列保存一个对象,这可以通过 Pickle 轻松完成。
This answer is based on this questionwhich I feel wasn't totally clear to me, therefore added this example.
这个答案是基于这个我觉得不是很清楚的问题,因此添加了这个例子。