java Java文本分类问题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2821575/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 22:59:50  来源:igfitidea点击:

Java text classification problem

javamachine-learningnlptext-processingclassification

提问by Youssef

I have a set of Books objects, classs Bookis defined as following :

我有一组 Books 对象,类Book定义如下:

Class Book{

String title;
ArrayList<tags> taglist;

}

Where titleis the title of the book, example : Javascript for dummies.

其中title是书名,例如:Javascript for dummies

and taglistis a list of tags for our example : Javascript, jquery, "web dev", ..

标记列表是我们的示例的标签列表:JavaScript中,jQuery的,“网站开发”,..

As I said a have a set of books talking about different things : IT, BIOLOGY, HISTORY, ... Each book has a title and a set of tags describing it..

正如我所说,有一套书在谈论不同的事情:IT、生物学、历史……每本书都有一个标题和一组描述它的标签。

I have to classify automaticaly those books into separated sets by topic, example :

我必须按主题将这些书自动分类为单独的集合,例如:

IT BOOKS :

它的书:

  • Java for dummies
  • Javascript for dummies
  • Learn flash in 30 days
  • C++ programming
  • 傻瓜的Java
  • 假人的 Javascript
  • 30 天学会闪光
  • C++编程

HISTORY BOOKS :

历史书籍:

  • World wars
  • America in 1960
  • Martin luther king's life
  • 世界大战
  • 1960年的美国
  • 马丁路德金的生平

BIOLOGY BOOKS :

生物书籍:

  • ....
  • ....

Do you guys know a classification algorithm/method to apply for that kind of problems ?

你们知道申请那种问题的分类算法/方法吗?

A solution is to use an external API to define the category of the text, but the problem here is that books are in different languages : french, spanish, english ..

一个解决方案是使用外部 API 来定义文本的类别,但这里的问题是书籍是不同语言的:法语、西班牙语、英语 ..

回答by dmcer

This looks like a reasonably straightforward keyword-based classification task. Since you're using Java, good packages to consider for this would be Classifier4J, Weka, or Lucene Mahout.

这看起来是一个相当简单的基于关键字的分类任务。由于您使用的是 Java,因此可以考虑使用Classifier4JWekaLucene Mahout 的好包。

Classifier4J

分类器4J

Classifier4J supports classification using naive Bayesand a vector spacemodel.

Classifier4J 支持使用朴素贝叶斯向量空间模型进行分类。

As seen in this source code snippeton training and scoring using its naive Bayes classifier, the package is reasonably easy to use. It's also distributed under the liberal Apache Software License.

正如在使用朴素贝叶斯分类器进行训练和评分的源代码片段中所见,该软件包相当易于使用。它也是在自由的Apache 软件许可证分发的

Weka

威卡

Weka is a very popular tool for data mining. An advantage of using it is that you'd be able to readily experiment with using numerous different machine learning modelsto categorize the books into topics including naive Bayes, decision trees, support vector machines, k-nearest neighbor, logistic regression, and even a rule set based learner.

Weka 是一种非常流行的数据挖掘工具。使用它的一个好处是你可以很容易地尝试使用许多不同的机器学习模型来将书籍分类为主题,包括朴素贝叶斯决策树支持向量机k-最近邻逻辑回归,甚至基于规则集的学习器

You'll find a tutorial on using Weka for text categorization here.

您可以在此处找到有关使用 Weka 进行文本分类的教程。

Weka is, however, distributed under the GPL. You won't be able to use it for closed source software that you want to distribute. But, you could still use it to back a web service.

然而,Weka 是在GPL下分发的。您将无法将其用于要分发的闭源软件。但是,您仍然可以使用它来支持 Web 服务。

Lucene Mahout

Lucene Mahout

Mahout is designed for doing machine learning on very large datasets. It's built on top of Apache Hadoopand supports supervised classification using naive Bayes.

Mahout 旨在对非常大的数据集进行机器学习。它建立在Apache Hadoop 之上,支持使用朴素贝叶斯的监督分类。

You'll find a tutorial covering how to use Mahout for text classification here.

您将在此处找到介绍如何使用 Mahout 进行文本分类的教程。

Like Classifier4J, Mahout is distributed under the liberal Apache Software License.

与 Classifier4J 一样,Mahout 也是在自由的Apache 软件许可证分发的

回答by Claudiu

Do you not want something as simple as this?

你不想要这么简单的东西吗?

Map<Tag, ArrayList<Book>> m = {};
for (Book b : books) {
    for (tag t : b.taglist) {
        m.get(t).add(b);
    }
}

Now m.get("IT")will return all IT books, etc...

现在m.get("IT")将归还所有 IT 书籍等...

Sure some books will appear in multiple categories, but that happens in real life, too...

当然有些书会出现在多个类别中,但在现实生活中也会发生这种情况......

回答by tylermac

So you are looking to make a Map of Tags that holds a Collection of Books?

因此,您正在寻找制作包含书籍收藏的标签地图?

EDIT:

编辑:

Sounds like you might want to take a look at a Vector Space Modelto apply classification of categories.

听起来您可能想看看矢量空间模型来应用类别分类。

Either Luceneor Classifier4joffer a framework for this.

无论是Lucene的Classifier4j为此提供了一个框架。

回答by JRL

You might want to look up fuzzy matching algorithmssuch as Soundex and Levenshtein.

您可能想要查找模糊匹配算法,例如 Soundex 和 Levenshtein。