Python 获取 ValueError: y 使用 scikit learn 的 LabelEncoder 时包含新标签
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46288517/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Getting ValueError: y contains new labels when using scikit learn's LabelEncoder
提问by Xavier
I have a series like:
我有一个系列,如:
df['ID'] = ['ABC123', 'IDF345', ...]
I'm using scikit's LabelEncoder
to convert it to numerical values to be fed into the RandomForestClassifier
.
我正在使用 scikitLabelEncoder
将其转换为要输入到RandomForestClassifier
.
During the training, I'm doing as follows:
在培训期间,我的做法如下:
le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)
But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id
i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.
但是,现在为了测试/预测,当我传入新数据时,我想根据le_id
ie从此数据转换“ID” ,如果存在相同的值,则根据上述标签编码器对其进行转换,否则分配一个新的数字价值。
In the test file, I was doing as follows:
在测试文件中,我的操作如下:
new_df['ID'] = le_dpid.transform(new_df.ID)
But, I'm getting the following error: ValueError: y contains new labels
但是,我收到以下错误: ValueError: y contains new labels
How do I fix this?? Thanks!
我该如何解决??谢谢!
UPDATE:
更新:
So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low'
values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.
所以我的任务是使用以下(例如)作为训练数据并预测'High', 'Mod', 'Low'
新 BankNum、ID 组合的值。模型应该学习从训练数据集中给出“高”和“低”的特征。例如,当存在多个具有相同 BankNum 和不同 ID 的条目时,会在“High”下方给出。
df =
BankNum | ID | Labels
0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low
And then predict it on something like:
然后根据以下内容对其进行预测:
BankNum | ID |
00982222 | AB999 |
00982222 | AB999 |
00981111 | AB890 |
I'm doing something like this:
我正在做这样的事情:
df['BankNum'] = df.BankNum.astype(np.float128)
le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)
X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
clf = RandomForestClassifier(random_state=42, n_estimators=140)
clf.fit(X_train, y_train)
回答by zimmerrol
I think the error message is very clear: Your test dataset contains ID
labels which have not been included in your training data set. For this items, the LabelEncoder
can not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.
我认为错误消息非常清楚:您的测试数据集包含ID
未包含在您的训练数据集中的标签。对于这些项目,LabelEncoder
找不到合适的数值来表示。有几种方法可以解决这个问题。您可以尝试平衡数据集,以确保每个标签不仅存在于您的测试中,而且存在于您的训练数据中。否则,您可以尝试遵循此处提出的想法之一。
One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique ID
values, train the LabelEncoder
on this list, and keep the rest of your code just as it is at the moment.
一种可能的解决方案是,您在开始时搜索数据集,获取所有唯一ID
值的LabelEncoder
列表,在此列表上训练,并保持其余代码保持当前状态。
An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id
(or something like this). Doin this, you put all new, unknown ID
s in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.
另一种可能的解决方案是,检查测试数据是否只有在训练过程中见过的标签。如果有新标签,您必须将其设置为一些后备值unknown_id
(或类似的值)。这样做,你把所有新的、未知的ID
s 放在一个类中;对于这些项目,预测将失败,但您可以像现在一样使用其余的代码。
回答by Yury Wallet
you can try solution from "sklearn.LabelEncoder with never seen before values" https://stackoverflow.com/a/48169252/9043549The thing is to create dictionary with classes, than map column and fill new classes with some "known value"
你可以从“sklearn.LabelEncoder with never seen before values”中尝试解决方案https://stackoverflow.com/a/48169252/9043549事情是用类创建字典,而不是映射列并用一些“已知值”填充新类
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
suf="_le"
col="a"
df[col+suf] = le.fit_transform(df[col])
dic = dict(zip(le.classes_, le.transform(le.classes_)))
col='b'
df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int)
回答by Marco Cerliani
In this way, you can map with 0 all the unseen labels in your test/unseen data
通过这种方式,您可以使用 0 映射测试/未见数据中的所有未见标签
for feat in ['BankNum', 'ID']:
lbe = LabelEncoder()
lbe.fit(X_train[feat].values)
diz_map_train = dict(zip(lbe.classes_, lbe.transform(lbe.classes_)+1))
for i in set(X_test[feat]).difference(X_train[feat]):
diz_map_train[i] = 0
X_train[feat] = [diz_map_train[i] for i in X_train[feat].values]
X_test[feat] = [diz_map_train[i] for i in X_test[feat].values]
回答by Arun Ganesan
I used
我用了
le.fit_transform(Col)
and I was able to resolve the issue. It does fit and transform both. we dont need to worry about unknown values in the test split
我能够解决这个问题。它确实适合并改变两者。我们不需要担心测试拆分中的未知值