java 朴素贝叶斯文本分类算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27843177/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Naive Bayes Text Classification Algorithm
提问by Java Nerd
Hye there! I just need the help for implementing Naive Bayes Text Classification Algorithm in Java to just test my Data Set for research purposes. It is compulsory to implement the algorithm in Java; rather using Weka or Rapid Miner tools to get the results!
嘿嘿!我只需要在 Java 中实现朴素贝叶斯文本分类算法的帮助,就可以测试我的数据集以进行研究。必须用 Java 实现算法;而是使用 Weka 或 Rapid Miner 工具来获得结果!
My Data Set has the following type of Data:
我的数据集具有以下类型的数据:
Doc Words Category
Means that I have the Training Words and Categories for each training (String) known in advance. Some of the Data Set is given below:
意味着我预先知道每个训练(字符串)的训练词和类别。下面给出了一些数据集:
Doc Words Category
Training
1 Integration Communities Process Oriented Structures...(more string) A
2 Integration Communities Process Oriented Structures...(more string) A
3 Theory Upper Bound Routing Estimate global routing...(more string) B
4 Hardware Design Functional Programming Perfect Match...(more string) C
.
.
.
Test
5 Methodology Toolkit Integrate Technological Organisational
6 This test contain string naive bayes test text text test
SO the Data Set comes from a MySQL DataBase and it may contain multiple training strings and test strings as well! The thing is I just need to implement Naive Bayes Text Classification Algorithm in Java.
所以数据集来自 MySQL 数据库,它可能包含多个训练字符串和测试字符串!问题是我只需要在 Java 中实现朴素贝叶斯文本分类算法。
The algorithm should follow the following example mentioned hereTable 13.1
该算法应遵循此处提到的以下示例表 13.1
Source: Read here
来源:在这里阅读
The thing is that I can implement the algorithm in Java Code myself but i just need to know if it is possible that there exist some kind a Java library with source code documentation available to allow me to just test the results.
问题是我可以自己在 Java 代码中实现该算法,但我只需要知道是否可能存在某种带有源代码文档的 Java 库,以允许我测试结果。
The problem is I just need the results for just one time only means its just a test for results.
问题是我只需要一次结果仅意味着它只是对结果的测试。
So, come to the point can somebody tell me about any good java library that helps my code this algorithm in Java and that could made my dataset possible to process the results, or can somebody give me any good ideas how to do it easily...something good that can help me.
所以,说到点,有人可以告诉我有什么好的 Java 库可以帮助我用 Java 编写这个算法,并且可以使我的数据集可以处理结果,或者有人可以给我任何如何轻松完成的好主意.. .一些可以帮助我的好东西。
I will be thankful for your help. Thanks in advance
我会感谢你的帮助。提前致谢
回答by Anurag
As per your requirement, you can use the Machine learning library MLlibfrom apache. The MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities. There is also a java code template to implement the algorithm utilizing the library. So to begin with, you can:
根据您的要求,您可以使用apache的机器学习库MLlib。MLlib 是 Spark 的可扩展机器学习库,由常见的学习算法和实用程序组成。还有一个java代码模板来实现利用库的算法。因此,首先,您可以:
Implement the java skeleton for the Naive Bayesprovided on their siteas given below.
为他们网站上提供的朴素贝叶斯实现 Java 框架,如下所示。
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import scala.Tuple2;
JavaRDD<LabeledPoint> training = ... // training set
JavaRDD<LabeledPoint> test = ... // test set
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
JavaPairRDD<Double, Double> predictionAndLabel =
test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
@Override public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
}
});
double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
@Override public Boolean call(Tuple2<Double, Double> pl) {
return pl._1().equals(pl._2());
}
}).count() / (double) test.count();
For testing your datasets, there is no best solution here than use the Spark SQL. MLlib fits into Spark's APIs perfectly. To start using it, I would recommend you to go through the MLlib APIfirst, implementing the Algorithm according to your needs. This is pretty easy using the library. For the next step to allow the processing of your datasets possible, just use the Spark SQL. I will recommend you to stick to this. I too have hunted down multiple options before settling for this easy to use library and it's seamless support for inter-operations with some other technologies. I would have posted the complete code here to perfectly fit your answer. But I think you are good to go.
为了测试您的数据集,这里没有比使用Spark SQL更好的解决方案。MLlib 非常适合 Spark 的 API。要开始使用它,我建议您首先通过MLlib API,根据您的需要实现算法。使用该库非常容易。对于允许处理数据集的下一步,只需使用Spark SQL。我会建议你坚持这一点。在确定这个易于使用的库之前,我也寻找了多个选项,它无缝支持与其他一些技术的互操作。我会在这里发布完整的代码以完全符合您的答案。但我认为你很高兴去。
回答by zoozoofreak
You can use the Weka Java API and include it in your project if you do not want to use the GUI.
如果您不想使用 GUI,您可以使用 Weka Java API 并将其包含在您的项目中。
Here's a link to the documentation to incorporate a classifier in your code: https://weka.wikispaces.com/Use+WEKA+in+your+Java+code
这是在您的代码中加入分类器的文档链接:https: //weka.wikispaces.com/Use+WEKA+in+your+Java+code
回答by Taufiqur Rahman
If you want to implement Naive Bayes Text Classification Algorithm in Java, then WEKA Java API will be a better solution. The data set should have to be in .arff format. Creating an .arff file from mySql database is very easy. Here is the attachment of the java code for the classifier a link of a sample .arff file.
如果你想在 Java 中实现朴素贝叶斯文本分类算法,那么 WEKA Java API 将是一个更好的解决方案。数据集必须是 .arff 格式。从 mySql 数据库创建 .arff 文件非常简单。这是分类器的 java 代码的附件,示例 .arff 文件的链接。
Create a new Text document. Open it with Notepad. Copy and paste all the texts below the link. Save it as DataSet.arff. http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.arff
创建一个新的文本文档。用记事本打开。复制并粘贴链接下方的所有文本。将其另存为 DataSet.arff。 http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.arff
Download Weka Java API: http://www.java2s.com/Code/Jar/w/weka.htm
下载 Weka Java API:http: //www.java2s.com/Code/Jar/w/weka.htm
Code for the classifier:
分类器的代码:
public static void main(String[] args) {
try {
StringBuilder txtAreaShow = new StringBuilder();
//reads the arff file
BufferedReader breader = null;
breader = new BufferedReader(new FileReader("DataSet.arff"));
//if 40 attributes availabe then 39 will be the class index/attribuites(yes/no)
Instances train = new Instances(breader);
train.setClassIndex(train.numAttributes() - 1);
breader.close();
//
NaiveBayes nB = new NaiveBayes();
nB.buildClassifier(train);
Evaluation eval = new Evaluation(train);
eval.crossValidateModel(nB, train, 10, new Random(1));
System.out.println("Run Information\n=====================");
System.out.println("Scheme: " + train.getClass().getName());
System.out.println("Relation: ");
System.out.println("\nClassifier Model(full training set)\n===============================");
System.out.println(nB);
System.out.println(eval.toSummaryString("\nSummary Results\n==================", true));
System.out.println(eval.toClassDetailsString());
System.out.println(eval.toMatrixString());
//txtArea output
txtAreaShow.append("\n\n\n");
txtAreaShow.append("Run Information\n===================\n");
txtAreaShow.append("Scheme: " + train.getClass().getName());
txtAreaShow.append("\n\nClassifier Model(full training set)"
+ "\n======================================\n");
txtAreaShow.append("" + nB);
txtAreaShow.append(eval.toSummaryString("\n\nSummary Results\n==================\n", true));
txtAreaShow.append(eval.toClassDetailsString());
txtAreaShow.append(eval.toMatrixString());
txtAreaShow.append("\n\n\n");
System.out.println(txtAreaShow.toString());
} catch (FileNotFoundException ex) {
System.err.println("File not found");
System.exit(1);
} catch (IOException ex) {
System.err.println("Invalid input or output.");
System.exit(1);
} catch (Exception ex) {
System.err.println("Exception occured!");
System.exit(1);
}
回答by Rasmus Berg Palm
回答by rajah9
Please take a look at the Bow toolkit.
请查看Bow 工具包。
It has a Gnu license and source code. Some of its code includes
它有一个 Gnu 许可证和源代码。它的一些代码包括
Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.
Performing test/train splits, and automatic classification tests.
根据朴素贝叶斯、TFIDF 和其他几种方法设置词向量权重。
执行测试/训练拆分和自动分类测试。
It's not a Java library, but you could compile the C code to ensure that you Java had similar results for a given corpus.
它不是一个 Java 库,但您可以编译 C 代码以确保您的 Java 对给定的语料库具有相似的结果。
I also spotted a decent Dr. Dobbs articlethat implements in Perl. Once again, not the desired Java, but will give you the one-time results that you are asking for.
我还发现了一篇用 Perl 实现的不错的Dr. Dobbs 文章。再一次,不是想要的 Java,而是会给你你所要求的一次性结果。
回答by guignol
Hi I thinks Spark would help you a lot: http://spark.apache.org/docs/1.2.0/mllib-naive-bayes.htmlyou can even choose the language you think is the most appropriate to your needs Java / Python / Scala!
嗨,我认为 Spark 会帮助你很多:http: //spark.apache.org/docs/1.2.0/mllib-naive-bayes.html你甚至可以选择你认为最适合你需要的语言 Java /蟒蛇/斯卡拉!
回答by newbieee
You may want to take a look at this.
你可能想看看这个。
https://mahout.apache.org/users/classification/bayesian.html
https://mahout.apache.org/users/classification/bayesian.html
回答by sramij
回答by leonprou
You can use an algorithm platform like KNIME, it has variety of classification algorithms (Naive bayed included). You can run it with a GUI or Java API.
您可以使用像 KNIME 这样的算法平台,它有多种分类算法(包括 Naive bayed)。您可以使用 GUI 或 Java API 运行它。