Python labelEncoder 在 sklearn 中的工作

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41773751/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 01:35:17  来源:igfitidea点击:

Working of labelEncoder in sklearn

pythonmachine-learningscikit-learncategorical-data

提问by Neo

Say I have the following input feature:

假设我有以下输入功能:

hotel_id = [1, 2, 3, 2, 3]

This is a categorical feature with numeric values. If I give it to the model as it is, the model will treat it as continuous variable, ie., 2 > 1.

这是一个带有数值的分类特征。如果我将它按原样提供给模型,模型会将其视为连续变量,即 2 > 1。

If I apply sklearn.labelEncoder()then I will get:

如果我申请,sklearn.labelEncoder()那么我将获得:

hotel_id = [0, 1, 2, 1, 2] 

So this encoded feature is considered as continuous or categorical? If it is treated as continuous then whats the use of labelEncoder().

那么这个编码特征被认为是连续的还是分类的?如果它被视为连续的,那么 labelEncoder() 的用途是什么。

P.S. I know about one hot encoding. But there are around 100 hotel_ids so dont want to use it. Thanks

PS我知道一种热编码。但是大约有 100 个 hotel_id,所以不想使用它。谢谢

回答by Tgsmith61591

The LabelEncoderis a way to encode class levels. In addition to the integer example you've included, consider the following example:

LabelEncoder是一种对类级别进行编码的方法。除了您包含的整数示例之外,请考虑以下示例:

>>> from sklearn.preprocessing import LabelEncoder
>>> le = LabelEncoder()
>>>
>>> train = ["paris", "paris", "tokyo", "amsterdam"]
>>> test = ["tokyo", "tokyo", "paris"]
>>> le.fit(train).transform(test)
array([2, 2, 1]...)

What the LabelEncoderallows us to do, then, is to assign ordinal levels to categorical data. However, what you've noted is correct: namely, the [2, 2, 1]is treated as numeric data. This is a good candidate for using the OneHotEncoderfor dummy variables (which I know you said you were hoping not to use).

那么,LabelEncoder允许我们做的是为分类数据分配有序级别。但是,您注意到的是正确的:即,[2, 2, 1]被视为数字数据。这是使用OneHotEncoderfor 虚拟变量的一个很好的候选者(我知道你说你不希望使用它)。

Note that the LabelEncodermust be used prior to one-hot encoding, as the OneHotEncodercannot handle categorical data. Therefore, it is frequently used as pre-cursor to one-hot encoding.

请注意,LabelEncoder必须在单热编码之前使用,因为OneHotEncoder无法处理分类数据。因此,它经常被用作 one-hot 编码的前驱。

Alternatively, it can encode your target into a usable array. If, for instance, trainwere your target for classification, you would need a LabelEncoderto use it as your y variable.

或者,它可以将您的目标编码为可用数组。例如,如果train是您的分类目标,则需要 aLabelEncoder将其用作 y 变量。

回答by simon

If you are running a classification model then the labels are treated as classes and the order is ignored. You don't need to onehot.

如果您正在运行分类模型,则标签将被视为类,而顺序将被忽略。你不需要onehot。

回答by J. Doe

A way to handle this problem is to change your numbers to label with package inflect

处理此问题的一种方法是将您的数字更改为带有包装变形的标签

So I have been visiting all numbers of hotels id's and I have changed them into words for example 1 -> 'one' and 2 -> 'two' ... 99 -> 'ninety-nine'

因此,我一直在访问所有数量的酒店 ID,并将它们更改为单词,例如 1 -> 'one' 和 2 -> 'two' ... 99 -> '99'

import inflect
p = inflect.engine()

def toNominal(df,column):
for index, row in df.iterrows():
    df.loc[index, column] =  p.number_to_words(df.loc[index, column])

toNominal(df, 'hotel_id')