Python Keras:用于单热编码的类权重(class_weight)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43481490/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Keras: class weights (class_weight) for one-hot encoding
提问by Naoto Usuyama
I'd like to use class_weight argument in keras model.fit to handle the imbalanced training data. By looking at some documents, I understood we can pass a dictionary like this:
我想在 keras model.fit 中使用 class_weight 参数来处理不平衡的训练数据。通过查看一些文档,我了解到我们可以传递这样的字典:
class_weight = {0 : 1,
1: 1,
2: 5}
(In this example, class-2 will get higher penalty in the loss function.)
(在这个例子中,class-2 将在损失函数中得到更高的惩罚。)
The problem is that my network's output has one-hot encoding i.e. class-0 = (1, 0, 0), class-1 = (0, 1, 0), and class-3 = (0, 0, 1).
问题是我的网络输出有单热编码,即 class-0 = (1, 0, 0), class-1 = (0, 1, 0), and class-3 = (0, 0, 1)。
How can we use the class_weight for one-hot encoded output?
我们如何将 class_weight 用于单热编码输出?
By looking at some codes in Keras, it looks like _feed_output_names
contain a list of output classes, but in my case, model.output_names
/model._feed_output_names
returns ['dense_1']
通过查看Keras中的一些代码,它看起来_feed_output_names
包含一个输出类列表,但在我的情况下,model.output_names
/model._feed_output_names
返回['dense_1']
Related: How to set class weights for imbalanced classes in Keras?
采纳答案by Naoto Usuyama
I guess we can use sample_weights
instead. Inside Keras, actually, class_weights
are converted to sample_weights
.
我想我们可以用它sample_weights
代替。在 Keras 内部,实际上,class_weights
被转换为sample_weights
.
sample_weight: optional array of the same length as x, containing weights to apply to the model's loss for each sample. In the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample. In this case you should make sure to specify sample_weight_mode="temporal" in compile().
sample_weight:与 x 长度相同的可选数组,包含应用于每个样本的模型损失的权重。在时间数据的情况下,您可以传递一个形状为 (samples, sequence_length) 的二维数组,以对每个样本的每个时间步应用不同的权重。在这种情况下,您应该确保在 compile() 中指定 sample_weight_mode="temporal"。
回答by Melissa
Here's a solution that's a bit shorter and faster. If your one-hot encoded y is a np.array:
这是一个更短、更快的解决方案。如果您的单热编码 y 是 np.array:
import numpy as np
from sklearn.utils.class_weight import compute_class_weight
y_integers = np.argmax(y, axis=1)
class_weights = compute_class_weight('balanced', np.unique(y_integers), y_integers)
d_class_weights = dict(enumerate(class_weights))
d_class_weights
can then be passed to class_weight
in .fit
.
d_class_weights
然后可以传递给class_weight
in .fit
。
回答by tw0000
A little bit of a convoluted answer, but the best I've found so far. This assumes your data is one-hot encoded, multi-class, and working only on the labels DataFrame df_y
:
有点令人费解的答案,但迄今为止我发现的最好的答案。这假设您的数据是单热编码的、多类的,并且仅适用于标签 DataFrame df_y
:
import pandas as pd
import numpy as np
# Create a pd.series that represents the categorical class of each one-hot encoded row
y_classes = df_y.idxmax(1, skipna=False)
from sklearn.preprocessing import LabelEncoder
# Instantiate the label encoder
le = LabelEncoder()
# Fit the label encoder to our label series
le.fit(list(y_classes))
# Create integer based labels Series
y_integers = le.transform(list(y_classes))
# Create dict of labels : integer representation
labels_and_integers = dict(zip(y_classes, y_integers))
from sklearn.utils.class_weight import compute_class_weight, compute_sample_weight
class_weights = compute_class_weight('balanced', np.unique(y_integers), y_integers)
sample_weights = compute_sample_weight('balanced', y_integers)
class_weights_dict = dict(zip(le.transform(list(le.classes_)), class_weights))
This results in a sample_weights
vector computed to balance an imbalanced dataset which can be passed to the Keras sample_weight
property, and a class_weights_dict
that can be fed to the Keras class_weight
property in the .fit
method. You don't really want to use both, just choose one. I'm using class_weight
right now because it's complicated to get sample_weight
working with fit_generator
.
这会产生一个sample_weights
计算向量以平衡可以传递给 Kerassample_weight
属性的不平衡数据集,并且class_weights_dict
可以将其提供给方法中的 Kerasclass_weight
属性.fit
。您真的不想同时使用两者,只需选择一个即可。我现在正在使用,class_weight
因为sample_weight
使用fit_generator
.
回答by pglaser
in _standardize_weights
, keras does:
在_standardize_weights
,keras 做:
if y.shape[1] > 1:
y_classes = y.argmax(axis=1)
so basically, if you choose to use one-hot encoding, the classes are the column index.
所以基本上,如果你选择使用 one-hot 编码,类就是列索引。
You may also ask yourself how you can map the column index to the original classes of your data.
Well, if you use the LabelEncoder class of scikit learn to perform one-hot encoding, the column index maps the order of the unique labels
computed by the .fit
function.
The doc says
您可能还会问自己如何将列索引映射到数据的原始类。好吧,如果你使用scikit的LabelEncoder类学习执行one-hot编码,列索引映射unique labels
了.fit
函数计算的顺序。医生说
Extract an ordered array of unique labels
提取唯一标签的有序数组
Example:
例子:
from sklearn.preprocessing import LabelBinarizer
y=[4,1,2,8]
l=LabelBinarizer()
y_transformed=l.fit_transorm(y)
y_transormed
> array([[0, 0, 1, 0],
[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 0, 1]])
l.classes_
> array([1, 2, 4, 8])
As a conclusion, the keys of the class_weights
dictionary should reflect the order in the classes_
attribute of the encoder.
总之,class_weights
字典的键应该反映classes_
编码器属性中的顺序。