Python 获取 ValueError: y 使用 scikit learn 的 LabelEncoder 时包含新标签

Question

提问by Xavier

I have a series like:

我有一个系列，如：

df['ID'] = ['ABC123', 'IDF345', ...]

I'm using scikit's LabelEncoderto convert it to numerical values to be fed into the RandomForestClassifier.

我正在使用 scikitLabelEncoder将其转换为要输入到RandomForestClassifier.

During the training, I'm doing as follows:

在培训期间，我的做法如下：

le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)

But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_idi.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.

但是，现在为了测试/预测，当我传入新数据时，我想根据le_idie从此数据转换“ID” ，如果存在相同的值，则根据上述标签编码器对其进行转换，否则分配一个新的数字价值。

In the test file, I was doing as follows:

在测试文件中，我的操作如下：

new_df['ID'] = le_dpid.transform(new_df.ID)

But, I'm getting the following error: ValueError: y contains new labels

但是，我收到以下错误： ValueError: y contains new labels

How do I fix this?? Thanks!

我该如何解决？？谢谢！

UPDATE:

更新：

So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low'values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.

所以我的任务是使用以下（例如）作为训练数据并预测'High', 'Mod', 'Low'新 BankNum、ID 组合的值。模型应该学习从训练数据集中给出“高”和“低”的特征。例如，当存在多个具有相同 BankNum 和不同 ID 的条目时，会在“High”下方给出。

df = 

BankNum   | ID    | Labels

0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low

And then predict it on something like:

然后根据以下内容对其进行预测：

BankNum   |  ID | 

00982222  | AB999 | 
00982222  | AB999 |
00981111  | AB890 |

I'm doing something like this:

我正在做这样的事情：

df['BankNum'] = df.BankNum.astype(np.float128)

    le_id = LabelEncoder()
    df['ID'] = le_id.fit_transform(df.ID)

X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
    clf = RandomForestClassifier(random_state=42, n_estimators=140)
    clf.fit(X_train, y_train)

Answer 1

回答by zimmerrol

I think the error message is very clear: Your test dataset contains IDlabels which have not been included in your training data set. For this items, the LabelEncodercan not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.

我认为错误消息非常清楚：您的测试数据集包含ID未包含在您的训练数据集中的标签。对于这些项目，LabelEncoder找不到合适的数值来表示。有几种方法可以解决这个问题。您可以尝试平衡数据集，以确保每个标签不仅存在于您的测试中，而且存在于您的训练数据中。否则，您可以尝试遵循此处提出的想法之一。

One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique IDvalues, train the LabelEncoderon this list, and keep the rest of your code just as it is at the moment.

一种可能的解决方案是，您在开始时搜索数据集，获取所有唯一ID值的LabelEncoder列表，在此列表上训练，并保持其余代码保持当前状态。

An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id(or something like this). Doin this, you put all new, unknown IDs in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.

另一种可能的解决方案是，检查测试数据是否只有在训练过程中见过的标签。如果有新标签，您必须将其设置为一些后备值unknown_id（或类似的值）。这样做，你把所有新的、未知的IDs 放在一个类中；对于这些项目，预测将失败，但您可以像现在一样使用其余的代码。

Answer 2

回答by Yury Wallet

you can try solution from "sklearn.LabelEncoder with never seen before values" https://stackoverflow.com/a/48169252/9043549The thing is to create dictionary with classes, than map column and fill new classes with some "known value"

你可以从“sklearn.LabelEncoder with never seen before values”中尝试解决方案https://stackoverflow.com/a/48169252/9043549事情是用类创建字典，而不是映射列并用一些“已知值”填充新类

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
suf="_le"
col="a"
df[col+suf] = le.fit_transform(df[col])
dic = dict(zip(le.classes_, le.transform(le.classes_)))
col='b'
df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int)

Answer 3

回答by Marco Cerliani

In this way, you can map with 0 all the unseen labels in your test/unseen data

通过这种方式，您可以使用 0 映射测试/未见数据中的所有未见标签

for feat in ['BankNum', 'ID']:

    lbe = LabelEncoder()
    lbe.fit(X_train[feat].values)
    diz_map_train = dict(zip(lbe.classes_, lbe.transform(lbe.classes_)+1))

    for i in set(X_test[feat]).difference(X_train[feat]):
        diz_map_train[i] = 0

    X_train[feat] = [diz_map_train[i] for i in X_train[feat].values]
    X_test[feat] = [diz_map_train[i] for i in X_test[feat].values]

Answer 4

回答by Arun Ganesan

I used

我用了

       le.fit_transform(Col)

and I was able to resolve the issue. It does fit and transform both. we dont need to worry about unknown values in the test split

我能够解决这个问题。它确实适合并改变两者。我们不需要担心测试拆分中的未知值

Python 获取 ValueError: y 使用 scikit learn 的 LabelEncoder 时包含新标签

提问by Xavier

回答by zimmerrol

回答by Yury Wallet

回答by Marco Cerliani

回答by Arun Ganesan

相关推荐

最近更新

标签

Python 获取 ValueError: y 使用 scikit learn 的 LabelEncoder 时包含新标签

提问by Xavier

回答by zimmerrol

回答by Yury Wallet

回答by Marco Cerliani

回答by Arun Ganesan

相关推荐

Python 如何更新 CSV 文件中的行

Python TensorFlow：libcudart.so.7.5：无法打开共享对象文件：没有这样的文件或目录

Python 为什么 localhost:5000 在 Flask 中不起作用？

Python matplotlib 中的粗体注释文本

相关推荐

最近更新

标签