Java 中是否有用于文本分析/挖掘的 API?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6800509/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 17:19:45  来源:igfitidea点击:

Are there APIs for text analysis/mining in Java?

javaapinlpanalysistext-mining

提问by Renato Dinhani

I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.

我想知道是否有一个API可以在Java中进行文本分析。可以提取文本中的所有单词、单独的单词、表达式等的东西。可以通知是否找到单词的东西是数字、日期、年份、名称、货币等。

I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.

我现在开始文本分析,所以我只需要一个 API 来启动。我做了一个网络爬虫,现在我需要一些东西来分析下载的数据。需要统计页面中的单词数、相似词、数据类型和其他与文本相关的资源的方法。

Are there APIs for text analysis in Java?

Java 中是否有用于文本分析的 API?

EDIT: Text-mining, I want to mining the text. An API for Java that provides this.

编辑:文本挖掘,我想挖掘文本。提供此功能的 Java API。

采纳答案by stemm

For example - you might use some classes from standard library java.text, or use StreamTokenizer(you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakesand for better performance you have to use something like fuzzy tokenizer- java.text and other standart utils has too limited capabilities in such context.

例如 - 您可以使用标准库中的一些类java.text,或者使用StreamTokenizer(您可以根据您的要求对其进行自定义)。但如你所知-从互联网资源文本数据通常是有很多正投影错误,并获得更好的性能,你必须使用类似模糊标记生成器- java.text中和其他非标准utils的已经非常有限,在这样的背景下的能力

So, I'd advice you to use regular expressions(java.util.regex) and create own kind of tokenizer according to your needs.

因此,我建议您使用正则表达式(java.util.regex) 并根据您的需要创建自己的标记器。

P.S.According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).

PS根据您的需要 - 您可以创建状态机解析器来识别原始文本中的模板部分。您可能会在下图中看到简单的状态机识别器(您可以构建更高级的解析器,它可以识别文本中更复杂的模板)。

enter image description here

在此处输入图片说明

回答by William Niu

It looks like you're looking for a Named Entity Recogniser.

看起来您正在寻找Named Entity Recogniser

You have got a couple of choices.

你有几个选择。

CRFClassifierfrom the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.

CRFClassifier从斯坦福大学自然语言处理组,是一个Java实现命名实体识别器。

GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorialgives you a better overview of what this software has to offer.

GATE(文本工程通用架构),一种用于语言处理的开源套件。看一下开发者页面的截图:http: //gate.ac.uk/family/developer.html。它应该让您简要了解这可以做什么。本视频教程给你的这是什么软件必须提供一个更好的概述。

You may need to customise one of them to fit your needs.

您可能需要自定义其中之一以满足您的需求。

You also have other options:

您还有其他选择:



In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:

关于 CRFClassifier 的训练,你可以在他们的 FAQ 中找到一个简短的解释

...the training data should be in tab-separated columns, and you define the meaning of those columns via a map. One column should be called "answer" and has the NER class, and existing features know about names like "word" and "tag". You define the data file, the map, and what features to generate via a properties file. There is considerable documentation of what features different properties generate in the Javadoc of NERFeatureFactory, though ultimately you have to go to the source code to answer some questions...

...训练数据应位于制表符分隔的列中,您可以通过地图定义这些列的含义。一列应称为“答案”并具有 NER 类,现有功能知道诸如“单词”和“标签”之类的名称。您可以通过属性文件定义数据文件、地图以及要生成的要素。在 NERFeatureFactory 的 Javadoc 中有大量关于不同属性生成哪些特性的文档,尽管最终你必须去源代码来回答一些问题......

You can also find a code snippet at the javadoc of CRFClassifier:

您还可以在CRFClassifierjavadoc 中找到代码片段:

Typical command-line usage

For running a trained model with a provided serialized classifier on a text file:

java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt

When specifying all parameters in a properties file (train, test, or runtime):

java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile

To train and test a simple NER model from the command line:

java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output

典型的命令行用法

要在文本文件上使用提供的序列化分类器运行训练模型:

java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt

在属性文件(训练、测试或运行时)中指定所有参数时:

java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile

从命令行训练和测试一个简单的 NER 模型:

java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output

回答by scott

If you're dealing with large amounts of data, maybe Apache's Lucenewill help with what you need.

如果您正在处理大量数据,也许 Apache 的Lucene会帮助您完成所需的工作。

Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.

否则,创建自己的 Analyzer 类可能是最简单的,该类严重依赖于标准 Pattern 类。这样,您就可以控制将哪些文本视为单词、边界、数字、日期等。例如,20110723 是日期还是数字?您可能需要实现多遍解析算法以更好地“理解”数据。

回答by sumit

I recommend looking at LingPipetoo. If you are OK with webservices then this articlehas a good summary of different APIs

我也建议您查看LingPipe。如果你对 webservices 没问题,那么这篇文章很好地总结了不同的 API

回答by Michael-O

I'd rather adapt Lucene's Analysis and Stemmer classes rather than reinventing the wheel. They have a vast majority of cases covered. See also the additional and contrib classes.

我宁愿调整 Lucene 的 Analysis 和 Stemmer 类,也不愿重新发明轮子。他们涵盖了绝大多数情况。另请参阅附加和贡献类。