Java 中是否有用于文本分析/挖掘的 API？

Question

提问by Renato Dinhani

I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.

我想知道是否有一个API可以在Java中进行文本分析。可以提取文本中的所有单词、单独的单词、表达式等的东西。可以通知是否找到单词的东西是数字、日期、年份、名称、货币等。

I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.

我现在开始文本分析，所以我只需要一个 API 来启动。我做了一个网络爬虫，现在我需要一些东西来分析下载的数据。需要统计页面中的单词数、相似词、数据类型和其他与文本相关的资源的方法。

Are there APIs for text analysis in Java?

Java 中是否有用于文本分析的 API？

EDIT: Text-mining, I want to mining the text. An API for Java that provides this.

编辑：文本挖掘，我想挖掘文本。提供此功能的 Java API。

Answer 1

采纳答案by stemm

For example - you might use some classes from standard library java.text, or use StreamTokenizer(you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakesand for better performance you have to use something like fuzzy tokenizer- java.text and other standart utils has too limited capabilities in such context.

例如 - 您可以使用标准库中的一些类java.text，或者使用StreamTokenizer（您可以根据您的要求对其进行自定义）。但如你所知-从互联网资源文本数据通常是有很多正投影错误，并获得更好的性能，你必须使用类似模糊标记生成器- java.text中和其他非标准utils的已经非常有限，在这样的背景下的能力。

So, I'd advice you to use regular expressions(java.util.regex) and create own kind of tokenizer according to your needs.

因此，我建议您使用正则表达式(java.util.regex) 并根据您的需要创建自己的标记器。

P.S.According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).

PS根据您的需要 - 您可以创建状态机解析器来识别原始文本中的模板部分。您可能会在下图中看到简单的状态机识别器（您可以构建更高级的解析器，它可以识别文本中更复杂的模板）。

enter image description here

在此处输入图片说明

Answer 2

回答by William Niu

It looks like you're looking for a Named Entity Recogniser.

看起来您正在寻找Named Entity Recogniser。

You have got a couple of choices.

你有几个选择。

CRFClassifierfrom the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.

CRFClassifier从斯坦福大学自然语言处理组，是一个Java实现命名实体识别器。

GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorialgives you a better overview of what this software has to offer.

GATE（文本工程通用架构），一种用于语言处理的开源套件。看一下开发者页面的截图：http: //gate.ac.uk/family/developer.html。它应该让您简要了解这可以做什么。本视频教程给你的这是什么软件必须提供一个更好的概述。

You may need to customise one of them to fit your needs.

您可能需要自定义其中之一以满足您的需求。

You also have other options:

您还有其他选择：

simple text extraction via Web services: e.g. Tagthe.netand Yahoo's Term Extractor.
part-of-speech (POS) tagging: extracting part-of-speech (e.g. verbs, nouns) from the text. Here is a post on SO: What is a good Java library for Parts-Of-Speech tagging?.

通过 Web 服务进行简单的文本提取：例如Tagthe.net和Yahoo 的 Term Extractor。
词性 (POS) 标记：从文本中提取词性（例如动词、名词）。这是一篇关于 SO 的帖子：什么是用于词性标记的好 Java 库？.

In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:

关于 CRFClassifier 的训练，你可以在他们的 FAQ 中找到一个简短的解释：

...the training data should be in tab-separated columns, and you define the meaning of those columns via a map. One column should be called "answer" and has the NER class, and existing features know about names like "word" and "tag". You define the data file, the map, and what features to generate via a properties file. There is considerable documentation of what features different properties generate in the Javadoc of NERFeatureFactory, though ultimately you have to go to the source code to answer some questions...

...训练数据应位于制表符分隔的列中，您可以通过地图定义这些列的含义。一列应称为“答案”并具有 NER 类，现有功能知道诸如“单词”和“标签”之类的名称。您可以通过属性文件定义数据文件、地图以及要生成的要素。在 NERFeatureFactory 的 Javadoc 中有大量关于不同属性生成哪些特性的文档，尽管最终你必须去源代码来回答一些问题......

You can also find a code snippet at the javadoc of CRFClassifier:

您还可以在CRFClassifier的javadoc 中找到代码片段：

Typical command-line usage
For running a trained model with a provided serialized classifier on a text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output

典型的命令行用法
要在文本文件上使用提供的序列化分类器运行训练模型：
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt
在属性文件（训练、测试或运行时）中指定所有参数时：
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
从命令行训练和测试一个简单的 NER 模型：
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output

Answer 3

回答by scott

If you're dealing with large amounts of data, maybe Apache's Lucenewill help with what you need.

如果您正在处理大量数据，也许 Apache 的Lucene会帮助您完成所需的工作。

Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.

否则，创建自己的 Analyzer 类可能是最简单的，该类严重依赖于标准 Pattern 类。这样，您就可以控制将哪些文本视为单词、边界、数字、日期等。例如，20110723 是日期还是数字？您可能需要实现多遍解析算法以更好地“理解”数据。

Answer 4

回答by sumit

I recommend looking at LingPipetoo. If you are OK with webservices then this articlehas a good summary of different APIs

我也建议您查看LingPipe。如果你对 webservices 没问题，那么这篇文章很好地总结了不同的 API

Answer 5

回答by Michael-O

I'd rather adapt Lucene's Analysis and Stemmer classes rather than reinventing the wheel. They have a vast majority of cases covered. See also the additional and contrib classes.

我宁愿调整 Lucene 的 Analysis 和 Stemmer 类，也不愿重新发明轮子。他们涵盖了绝大多数情况。另请参阅附加和贡献类。

Java 中是否有用于文本分析/挖掘的 API？

提问by Renato Dinhani

采纳答案by stemm

回答by William Niu

回答by scott

回答by sumit

回答by Michael-O

相关推荐

最近更新

标签

Java 中是否有用于文本分析/挖掘的 API？

提问by Renato Dinhani

采纳答案by stemm

回答by William Niu

回答by scott

回答by sumit

回答by Michael-O

相关推荐

java 为什么我的 JTextArea 没有更新？

java Android，如何在 Parcelable 类中正确使用 readTypedList 方法？

java ScheduledThreadPoolExecutor，如何停止可运行类JAVA

java.io.EOFException：ZLIB 输入流意外结束 - 从 HTTP 读取

相关推荐

最近更新

标签