使用 Java 中的 Weka 进行基本文本分类

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9707825/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-16 06:30:14  来源:igfitidea点击:

Basic text classification with Weka in Java

javaclassificationwekadocument-classification

提问by joxxe

Im trying to build a text classifier in JAVA with Weka. I have read some tutorials, and I′m trying to build my own classifier.

我试图用 Weka 在 JAVA 中构建一个文本分类器。我已经阅读了一些教程,我正在尝试构建自己的分类器。

I have the following categories:

我有以下几类:

    computer,sport,unknown 

and the following already trained data

以及以下已经训练好的数据

 cs belongs to computer
 java -> computer
 soccer -> sport
 snowboard -> sport

So for example, if a user wants to classify the word java, it should return the category computer (no doubt, java only exists in that category!).

因此,例如,如果用户要对单词 java 进行分类,则应返回类别计算机(毫无疑问,java 只存在于该类别中!)。

It does compile, but generates strange output.

它确实可以编译,但会生成奇怪的输出。

The output is:

输出是:

      ====== RESULT ======  CLASSIFIED AS:  [0.5769230769230769, 0.2884615384615385, 0.1346153846153846]
      ====== RESULT ======  CLASSIFIED AS:  [0.42857142857142855, 0.42857142857142855, 0.14285714285714285]

But the first text to classify is java and it occures only in the category computer, therefore it should be

但是第一个分类的文本是java,它只出现在分类计算机中,因此应该是

      [1.0 0.0 0.0] 

and for the other it shouldnt be found at all, so it should be classified as unknown

而另一个根本不应该被发现,所以它应该归类为未知

      [0.0 0.0 1.0].

Here is the code:

这是代码:

    import java.io.FileNotFoundException;
    import java.io.Serializable;
    import java.util.Arrays;

    import weka.classifiers.Classifier;
    import weka.classifiers.bayes.NaiveBayesMultinomialUpdateable;
    import weka.core.Attribute;
    import weka.core.FastVector;
    import weka.core.Instance;
    import weka.core.Instances;
    import weka.filters.Filter;
    import weka.filters.unsupervised.attribute.StringToWordVector;

    public class TextClassifier implements Serializable {

        private static final long serialVersionUID = -1397598966481635120L;
        public static void main(String[] args) {
            try {
                TextClassifier cl = new TextClassifier(new NaiveBayesMultinomialUpdateable());
                cl.addCategory("computer");
                cl.addCategory("sport");
                cl.addCategory("unknown");
                cl.setupAfterCategorysAdded();

                //
                cl.addData("cs", "computer");
                cl.addData("java", "computer");
                cl.addData("soccer", "sport");
                cl.addData("snowboard", "sport");

                double[] result = cl.classifyMessage("java");
                System.out.println("====== RESULT ====== \tCLASSIFIED AS:\t" + Arrays.toString(result));

                result = cl.classifyMessage("asdasdasd");
                System.out.println("====== RESULT ======\tCLASSIFIED AS:\t" + Arrays.toString(result));
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        private Instances trainingData;
        private StringToWordVector filter;
        private Classifier classifier;
        private boolean upToDate;
        private FastVector classValues;
        private FastVector attributes;
        private boolean setup;

        private Instances filteredData;

        public TextClassifier(Classifier classifier) throws FileNotFoundException {
            this(classifier, 10);
        }

        public TextClassifier(Classifier classifier, int startSize) throws FileNotFoundException {
            this.filter = new StringToWordVector();
            this.classifier = classifier;
            // Create vector of attributes.
            this.attributes = new FastVector(2);
            // Add attribute for holding texts.
            this.attributes.addElement(new Attribute("text", (FastVector) null));
            // Add class attribute.
            this.classValues = new FastVector(startSize);
            this.setup = false;

        }

        public void addCategory(String category) {
            category = category.toLowerCase();
            // if required, double the capacity.
            int capacity = classValues.capacity();
            if (classValues.size() > (capacity - 5)) {
                classValues.setCapacity(capacity * 2);
            }
            classValues.addElement(category);
        }

        public void addData(String message, String classValue) throws IllegalStateException {
            if (!setup) {
                throw new IllegalStateException("Must use setup first");
            }
            message = message.toLowerCase();
            classValue = classValue.toLowerCase();
            // Make message into instance.
            Instance instance = makeInstance(message, trainingData);
            // Set class value for instance.
            instance.setClassValue(classValue);
            // Add instance to training data.
            trainingData.add(instance);
            upToDate = false;
        }

        /**
         * Check whether classifier and filter are up to date. Build i necessary.
         * @throws Exception
         */
        private void buildIfNeeded() throws Exception {
            if (!upToDate) {
                // Initialize filter and tell it about the input format.
                filter.setInputFormat(trainingData);
                // Generate word counts from the training data.
                filteredData = Filter.useFilter(trainingData, filter);
                // Rebuild classifier.
                classifier.buildClassifier(filteredData);
                upToDate = true;
            }
        }

        public double[] classifyMessage(String message) throws Exception {
            message = message.toLowerCase();
            if (!setup) {
                throw new Exception("Must use setup first");
            }
            // Check whether classifier has been built.
            if (trainingData.numInstances() == 0) {
                throw new Exception("No classifier available.");
            }
            buildIfNeeded();
            Instances testset = trainingData.stringFreeStructure();
            Instance testInstance = makeInstance(message, testset);

            // Filter instance.
            filter.input(testInstance);
            Instance filteredInstance = filter.output();
            return classifier.distributionForInstance(filteredInstance);

        }

        private Instance makeInstance(String text, Instances data) {
            // Create instance of length two.
            Instance instance = new Instance(2);
            // Set value for message attribute
            Attribute messageAtt = data.attribute("text");
            instance.setValue(messageAtt, messageAtt.addStringValue(text));
            // Give instance access to attribute information from the dataset.
            instance.setDataset(data);
            return instance;
        }

        public void setupAfterCategorysAdded() {
            attributes.addElement(new Attribute("class", classValues));
            // Create dataset with initial capacity of 100, and set index of class.
            trainingData = new Instances("MessageClassificationProblem", attributes, 100);
            trainingData.setClassIndex(trainingData.numAttributes() - 1);
            setup = true;
        }

    }

Btw, found a good page:

顺便说一句,找到了一个很好的页面:

http://www.hakank.org/weka/TextClassifierApplet3.html

http://www.hakank.org/weka/TextClassifierApplet3.html

采纳答案by Lars Kotthoff

The Bayes classifier gives you a (weighted) probability that a word belongs to a category. This will almost never be exactly 0 or 1. You can either set a hard cutoff (e.g. 0.5) and decide membership for a class based on this, or inspect the calculated probabilities and decide based on that (i.e. the highest map to 1, the lowest to 0).

贝叶斯分类器为您提供一个词属于某个类别的(加权)概率。这几乎永远不会是 0 或 1。您可以设置硬截止(例如 0.5)并基于此决定类的成员资格,或者检查计算的概率并基于此决定(即最高映射到 1,最低到 0)。

回答by Ao.Shen

If you try to get definitive class instead of distributions, try to switch

如果您尝试获得明确的类而不是分布,请尝试切换

return classifier.distributionForInstance(filteredInstance);

return classifier.distributionForInstance(filteredInstance);

to

return classifier.classifyInstance(filteredInstance);

return classifier.classifyInstance(filteredInstance);

回答by harry

I thought i would just offer up that you could do most such text classification work with no coding by just downloading and using LightSIDE from http://lightsidelabs.com. This open source Java package includes WEKA, and is available for distributions on both Windows and Mac -- can can process most WEKA friendly data sets with great flexibility, allowing you to iterate through various models, settings and parameters and providing good support to snapshots and saving your data and models and classification results at any point until you have built a model you are happy with. This product proved itself in the ASAP competition on Kaggle.com last year, and is getting a lot of traction. Of course there are always reasons people want or need to "roll their own" but perhaps even as a check, knowing about and using LightSIDE if you are programming WEKA solutions could be very handy.

我想我只想说,您只需从http://lightsidelabs.com下载和使用 LightSIDE,就可以在没有编码的情况下完成大多数此类文本分类工作. 这个开源Java包包含WEKA,可在Windows和Mac上发行——可以非常灵活地处理大多数对WEKA友好的数据集,允许您迭代各种模型、设置和参数,并提供对快照和快照的良好支持随时保存您的数据和模型以及分类结果,直到您构建出满意的模型为止。该产品去年在 Kaggle.com 上的 ASAP 竞赛中证明了自己,并且获得了很大的关注。当然,人们想要或需要“自己动手”总是有原因的,但即使作为检查,如果您正在编写 WEKA 解决方案,了解和使用 LightSIDE 可能会非常方便。