Java 开源文本挖掘框架

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2303098/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 05:53:00  来源:igfitidea点击:

Java Open Source Text Mining Frameworks

javaframeworksmachine-learningnlpinformation-retrieval

提问by David Campos

I want to know what is the best open source Java based framework for Text Mining, to use botg Machine Learning and dictionary Methods.

我想知道什么是用于文本挖掘的最佳开源基于 Java 的框架,以使用 botg 机器学习和字典方法。

I'm using Mallet but there are not that much documentation and I do not know if it will fit all my requirements.

我正在使用 Mallet,但没有那么多文档,我不知道它是否符合我的所有要求。

回答by Amro

Although not a specialized text mining framework, Wekahas a number of classifiers usually employed in text mining tasks such as: SVM, kNN, multinomial NaiveBayes, among others.

虽然不是专门的文本挖掘框架,但Weka有许多通常用于文本挖掘任务的分类器,例如:SVM、kNN、多项式 NaiveBayes 等。

It also has a few filters to wok with textual data like the StringToWordVectorfilter which can perform TF/IDF transformation.

它还有一些过滤器可以处理文本数据,例如StringToWordVector可以执行 TF/IDF 转换的过滤器。

Check out the Weka wikiwebsite for more information.

查看Weka wiki网站了解更多信息。

回答by Pascal Thivent

回答by Steve

We use lucene to process live streams from the internet. It has a native java api.

我们使用 lucene 来处理来自互联网的实时流。它有一个原生的java api。

http://lucene.apache.org/java/docs/

http://lucene.apache.org/java/docs/

You can then use mahout which is a bunch of machien learning algorithms which operate on top of lucene.

然后,您可以使用 mahout,这是一组在 lucene 之上运行的机器学习算法。

http://lucene.apache.org/mahout/

http://lucene.apache.org/mahout/

回答by Jo?o Silva

I've used LingPipe-- a suite of Java libraries for the linguistic analysis of human language-- for text mining (and other related) tasks.

我已经使用LingPipe——一套用于人类语言语言分析的 Java 库—— 用于文本挖掘(和其他相关)任务。

It is a verywell documented software package, and the site contains several tutorials which thoroughly explain how to do a certain task with LingPipe, such as named entity recognition. There is also a newsgroup, wherein you can post any question you have about the software (or NLP related tasks), and have a prompt reply from the authors of the package themselves; and of course, a blog.

这是一个非常有据可查的软件包,而该网站包含几个教程,详细解释了如何做LingPipe,有一定的任务,如命名实体识别。还有一个新闻组,您可以在其中发布有关该软件(或 NLP 相关任务)的任何问题,并得到软件包作者本人的及时答复;当然,还有一个博客

The source code is also very easy to follow and well documented which, for me, is always a big plus.

源代码也很容易理解并且有很好的文档记录,这对我来说总是一个很大的优势。

As for Machine Learning algorithms, there are plenty, from Na?ve Bayes to Conditional Random Field. On the other hand, for dictionary-matching algorithms, they have an ExactDicitonaryChunker, which is an implementation of the Aho-Corasich algorithm (a very, very, fast algorithm for this task).

至于机器学习算法,有很多,从朴素贝叶斯到条件随机场。另一方面,对于字典匹配算法,它们有一个ExactDicitonaryChunker,它是 Aho-Corasich 算法(一种非常、非常、快速的算法)的实现。

In sum, I think it is one of the best NLP software package for Java (I haven't used every single package that is out there, so I can't say it's the best), and I definitely recommend it for the task that you have at hand.

总而言之,我认为它是最好的 Java NLP 软件包之一(我没有使用过那里的每一个软件包,所以我不能说它是最好的),我绝对推荐它来完成任务你手头有。

回答by PSpeed

You may already know about GATE: http://gate.ac.uk/

您可能已经了解 GATE:http: //gate.ac.uk/

...but that's what we've used (at my day job) for lots of different text mining problems. It's pretty flexible and open.

...但这就是我们(在我的日常工作中)用于解决许多不同文本挖掘问题的方法。它非常灵活和开放。

回答by paul

I built a maximum entropy named entity recognizer for CoNLL data using OpenNLP MaxEnt http://sourceforge.net/projects/maxent/for a course once.

我曾经使用 OpenNLP MaxEnt http://sourceforge.net/projects/maxent/为一门课程为 CoNLL 数据构建了一个最大熵命名实体识别器。

Required a lot of data preprocessing with custom perl scripts do get all the features extracted into nice neat numerical vectors though.

虽然需要使用自定义 perl 脚本进行大量数据预处理,但确实可以将所有特征提取到漂亮整洁的数值向量中。

回答by David Campos

I honestly think that the several answers presented here are very good. However, to fulfill my requirements I have chosen to use Apache UIMAwith ClearTK. It supports several ML Methods and I do not have any licences problem. Plus, I can make wrappers to other ML methodologies, and I take the advantage of the UIMA framework, which is very well organized and fast.

老实说,我认为这里提供的几个答案非常好。但是,为了满足我的要求,我选择将Apache UIMAClearTK 结合使用。它支持多种 ML 方法,我没有任何许可证问题。另外,我可以为其他 ML 方法制作包装器,并利用 UIMA 框架,该框架组织得非常好且速度很快。

Thank you all for your interesting answers.

谢谢大家的有趣回答。

Best Regards, ukrania

最好的问候,乌克兰