Python 使用 TensorFlow 的多标签文本分类

Question

提问by Benben

The text data is organized as vector with 20,000 elements, like [2, 1, 0, 0, 5, ...., 0]. i-th element indicates the frequency of the i-th word in a text.

文本数据被组织为具有 20,000 个元素的向量，例如 [2, 1, 0, 0, 5, ...., 0]。第 i 个元素表示第 i 个词在文本中的出现频率。

The ground truth label data is also represented as vector with 4,000 elements, like [0, 0, 1, 0, 1, ...., 0]. i-th element indicates whether the i-th label is a positive label for a text. The number of labels for a text differs depending on texts.

真实标签数据也表示为具有 4,000 个元素的向量，如 [0, 0, 1, 0, 1, ...., 0]。第 i 个元素指示第 i 个标签是否是文本的正标签。文本的标签数量因文本而异。

I have a code for single-label text classification.

我有一个用于单标签文本分类的代码。

How can I edit the following code for multilabel text classification?

如何编辑以下代码以进行多标签文本分类？

Especially, I would like to know following points.

特别是，我想知道以下几点。

How to compute accuracy using TensorFlow.
How to set a threshold which judges whether a label is positive or negative. For instance, if the output is [0.80, 0.43, 0.21, 0.01, 0.32] and the ground truth is [1, 1, 0, 0, 1], the labels with scores over 0.25 should be judged as positive.

如何使用 TensorFlow 计算准确度。
如何设置一个阈值来判断标签是正还是负。例如，如果输出是 [0.80, 0.43, 0.21, 0.01, 0.32] 并且基本事实是 [1, 1, 0, 0, 1]，那么分数超过 0.25 的标签应该被判断为正面。

Thank you.

谢谢你。

import tensorflow as tf

# hidden Layer
class HiddenLayer(object):
    def __init__(self, input, n_in, n_out):
        self.input = input

        w_h = tf.Variable(tf.random_normal([n_in, n_out],mean = 0.0,stddev = 0.05))
        b_h = tf.Variable(tf.zeros([n_out]))

        self.w = w_h
        self.b = b_h
        self.params = [self.w, self.b]

    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.nn.relu(linarg)

        return self.output

# output Layer
class OutputLayer(object):
    def __init__(self, input, n_in, n_out):
        self.input = input

        w_o = tf.Variable(tf.random_normal([n_in, n_out], mean = 0.0, stddev = 0.05))
        b_o = tf.Variable(tf.zeros([n_out]))

        self.w = w_o
        self.b = b_o
        self.params = [self.w, self.b]

    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.nn.relu(linarg)

        return self.output

# model
def model():
    h_layer = HiddenLayer(input = x, n_in = 20000, n_out = 1000)
    o_layer = OutputLayer(input = h_layer.output(), n_in = 1000, n_out = 4000)

    # loss function
    out = o_layer.output()
    cross_entropy = -tf.reduce_sum(y_*tf.log(out + 1e-9), name='xentropy')    

    # regularization
    l2 = (tf.nn.l2_loss(h_layer.w) + tf.nn.l2_loss(o_layer.w))
    lambda_2 = 0.01

    # compute loss
    loss = cross_entropy + lambda_2 * l2

    # compute accuracy for single label classification task
    correct_pred = tf.equal(tf.argmax(out, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, "float"))

    return loss, accuracy

Answer 1

采纳答案by Alok Nayak

Change relu to sigmoid of output layer. Modify cross entropy loss to explicit mathematical formula of sigmoid cross entropy loss (explicit loss was working in my case/version of tensorflow )

将 relu 更改为输出层的 sigmoid。将交叉熵损失修改为 sigmoid 交叉熵损失的显式数学公式（显式损失在我的案例/tensorflow 版本中有效）

import tensorflow as tf

# hidden Layer
class HiddenLayer(object):
    def __init__(self, input, n_in, n_out):
        self.input = input

        w_h = tf.Variable(tf.random_normal([n_in, n_out],mean = 0.0,stddev = 0.05))
        b_h = tf.Variable(tf.zeros([n_out]))

        self.w = w_h
        self.b = b_h
        self.params = [self.w, self.b]

    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.nn.relu(linarg)

        return self.output

# output Layer
class OutputLayer(object):
    def __init__(self, input, n_in, n_out):
        self.input = input

        w_o = tf.Variable(tf.random_normal([n_in, n_out], mean = 0.0, stddev = 0.05))
        b_o = tf.Variable(tf.zeros([n_out]))

        self.w = w_o
        self.b = b_o
        self.params = [self.w, self.b]

    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        #changed relu to sigmoid
        self.output = tf.nn.sigmoid(linarg)

        return self.output

# model
def model():
    h_layer = HiddenLayer(input = x, n_in = 20000, n_out = 1000)
    o_layer = OutputLayer(input = h_layer.output(), n_in = 1000, n_out = 4000)

    # loss function
    out = o_layer.output()
    # modified cross entropy to explicit mathematical formula of sigmoid cross entropy loss
    cross_entropy = -tf.reduce_sum( (  (y_*tf.log(out + 1e-9)) + ((1-y_) * tf.log(1 - out + 1e-9)) )  , name='xentropy' )    

    # regularization
    l2 = (tf.nn.l2_loss(h_layer.w) + tf.nn.l2_loss(o_layer.w))
    lambda_2 = 0.01

    # compute loss
    loss = cross_entropy + lambda_2 * l2

    # compute accuracy for single label classification task
    correct_pred = tf.equal(tf.argmax(out, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, "float"))

    return loss, accuracy

Answer 2

回答by jorgemf

You have to use variations of cross entropy function in other to support multilabel classification. In case you have less than one thousand of ouputs you should use sigmoid_cross_entropy_with_logits, in your case that you have 4000 outputs you may consider candidate samplingas it is faster than the previous.

您必须在其他中使用交叉熵函数的变体来支持多标签分类。如果您的输出少于一千个，您应该使用sigmoid_cross_entropy_with_logits，在您有 4000 个输出的情况下，您可以考虑候选采样，因为它比以前的更快。

How to compute accuracy using TensorFlow.

如何使用 TensorFlow 计算准确度。

This depends on your problem and what you want to achieve. If you don't want to miss any object in an image then if the classifier get all right but one, then you should consider the whole image an error. You can also consider that an object missed or missclassiffied is an error. The latter I think it supported by sigmoid_cross_entropy_with_logits.

这取决于您的问题以及您想要实现的目标。如果您不想错过图像中的任何对象，那么如果分类器没有问题，但只有一个，那么您应该将整个图像视为错误。您还可以将对象丢失或错误分类视为错误。后者我认为它由 sigmoid_cross_entropy_with_logits 支持。

How to set a threshold which judges whether a label is positive or negative. For instance, if the output is [0.80, 0.43, 0.21, 0.01, 0.32] and the ground truth is [1, 1, 0, 0, 1], the labels with scores over 0.25 should be judged as positive.

如何设置一个阈值来判断标签是正还是负。例如，如果输出是 [0.80, 0.43, 0.21, 0.01, 0.32] 并且基本事实是 [1, 1, 0, 0, 1]，那么分数超过 0.25 的标签应该被判断为正面。

Threshold is one way to go, you have to decided which one. But that is some kind of hack, not real multilable classification. For that you need the previous functions I said before.

阈值是一种方法，你必须决定哪一种。但这是某种黑客行为，而不是真正的多标签分类。为此，您需要我之前说过的先前功能。

Python 使用 TensorFlow 的多标签文本分类

提问by Benben

采纳答案by Alok Nayak

回答by jorgemf

相关推荐

最近更新

标签

Python 使用 TensorFlow 的多标签文本分类

提问by Benben

采纳答案by Alok Nayak

回答by jorgemf

相关推荐

Python 将数据帧转换为列表

Python “<位于[十六进制数字]>处的str对象的内置方法较低”是什么意思？

Python 在熊猫数据框中移动列

Python Apache Spark -- 将 UDF 的结果分配给多个数据框列

相关推荐

最近更新

标签