Python Keras:用于单热编码的类权重(class_weight)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43481490/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:07:02  来源:igfitidea点击:

Keras: class weights (class_weight) for one-hot encoding

pythonkeras

提问by Naoto Usuyama

I'd like to use class_weight argument in keras model.fit to handle the imbalanced training data. By looking at some documents, I understood we can pass a dictionary like this:

我想在 keras model.fit 中使用 class_weight 参数来处理不平衡的训练数据。通过查看一些文档,我了解到我们可以传递这样的字典:

class_weight = {0 : 1,
    1: 1,
    2: 5}

(In this example, class-2 will get higher penalty in the loss function.)

(在这个例子中,class-2 将在损失函数中得到更高的惩罚。)

The problem is that my network's output has one-hot encoding i.e. class-0 = (1, 0, 0), class-1 = (0, 1, 0), and class-3 = (0, 0, 1).

问题是我的网络输出有单热编码,即 class-0 = (1, 0, 0), class-1 = (0, 1, 0), and class-3 = (0, 0, 1)。

How can we use the class_weight for one-hot encoded output?

我们如何将 class_weight 用于单热编码输出?

By looking at some codes in Keras, it looks like _feed_output_namescontain a list of output classes, but in my case, model.output_names/model._feed_output_namesreturns ['dense_1']

通过查看Keras中的一些代码,它看起来_feed_output_names包含一个输出类列表,但在我的情况下,model.output_names/model._feed_output_names返回['dense_1']

Related: How to set class weights for imbalanced classes in Keras?

相关:如何为 Keras 中的不平衡类设置类权重?

采纳答案by Naoto Usuyama

I guess we can use sample_weightsinstead. Inside Keras, actually, class_weightsare converted to sample_weights.

我想我们可以用它sample_weights代替。在 Keras 内部,实际上,class_weights被转换为sample_weights.

sample_weight: optional array of the same length as x, containing weights to apply to the model's loss for each sample. In the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample. In this case you should make sure to specify sample_weight_mode="temporal" in compile().

sample_weight:与 x 长度相同的可选数组,包含应用于每个样本的模型损失的权重。在时间数据的情况下,您可以传递一个形状为 (samples, sequence_length) 的二维数组,以对每个样本的每个时间步应用不同的权重。在这种情况下,您应该确保在 compile() 中指定 sample_weight_mode="temporal"。

https://github.com/fchollet/keras/blob/d89afdfd82e6e27b850d910890f4a4059ddea331/keras/engine/training.py#L1392

https://github.com/fchollet/keras/blob/d89afdfd82e6e27b850d910890f4a4059ddea331/keras/engine/training.py#L1392

回答by Melissa

Here's a solution that's a bit shorter and faster. If your one-hot encoded y is a np.array:

这是一个更短、更快的解决方案。如果您的单热编码 y 是 np.array:

import numpy as np
from sklearn.utils.class_weight import compute_class_weight

y_integers = np.argmax(y, axis=1)
class_weights = compute_class_weight('balanced', np.unique(y_integers), y_integers)
d_class_weights = dict(enumerate(class_weights))

d_class_weightscan then be passed to class_weightin .fit.

d_class_weights然后可以传递给class_weightin .fit

回答by tw0000

A little bit of a convoluted answer, but the best I've found so far. This assumes your data is one-hot encoded, multi-class, and working only on the labels DataFrame df_y:

有点令人费解的答案,但迄今为止我发现的最好的答案。这假设您的数据是单热编码的、多类的,并且仅适用于标签 DataFrame df_y

import pandas as pd
import numpy as np

# Create a pd.series that represents the categorical class of each one-hot encoded row
y_classes = df_y.idxmax(1, skipna=False)

from sklearn.preprocessing import LabelEncoder

# Instantiate the label encoder
le = LabelEncoder()

# Fit the label encoder to our label series
le.fit(list(y_classes))

# Create integer based labels Series
y_integers = le.transform(list(y_classes))

# Create dict of labels : integer representation
labels_and_integers = dict(zip(y_classes, y_integers))

from sklearn.utils.class_weight import compute_class_weight, compute_sample_weight

class_weights = compute_class_weight('balanced', np.unique(y_integers), y_integers)
sample_weights = compute_sample_weight('balanced', y_integers)

class_weights_dict = dict(zip(le.transform(list(le.classes_)), class_weights))

This results in a sample_weightsvector computed to balance an imbalanced dataset which can be passed to the Keras sample_weightproperty, and a class_weights_dictthat can be fed to the Keras class_weightproperty in the .fitmethod. You don't really want to use both, just choose one. I'm using class_weightright now because it's complicated to get sample_weightworking with fit_generator.

这会产生一个sample_weights计算向量以平衡可以传递给 Kerassample_weight属性的不平衡数据集,并且class_weights_dict可以将其提供给方法中的 Kerasclass_weight属性.fit。您真的不想同时使用两者,只需选择一个即可。我现在正在使用,class_weight因为sample_weight使用fit_generator.

回答by pglaser

in _standardize_weights, keras does:

_standardize_weights,keras 做:

if y.shape[1] > 1:
    y_classes = y.argmax(axis=1)

so basically, if you choose to use one-hot encoding, the classes are the column index.

所以基本上,如果你选择使用 one-hot 编码,类就是列索引。

You may also ask yourself how you can map the column index to the original classes of your data. Well, if you use the LabelEncoder class of scikit learn to perform one-hot encoding, the column index maps the order of the unique labelscomputed by the .fitfunction. The doc says

您可能还会问自己如何将列索引映射到数据的原始类。好吧,如果你使用scikit的LabelEncoder类学习执行one-hot编码,列索引映射unique labels.fit函数计算的顺序。医生说

Extract an ordered array of unique labels

提取唯一标签的有序数组

Example:

例子:

from sklearn.preprocessing import LabelBinarizer
y=[4,1,2,8]
l=LabelBinarizer()
y_transformed=l.fit_transorm(y)
y_transormed
> array([[0, 0, 1, 0],
   [1, 0, 0, 0],
   [0, 1, 0, 0],
   [0, 0, 0, 1]])
l.classes_
> array([1, 2, 4, 8])

As a conclusion, the keys of the class_weightsdictionary should reflect the order in the classes_attribute of the encoder.

总之,class_weights字典的键应该反映classes_编码器属性中的顺序。