Java 的简单自然语言处理启动

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5833030/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 13:01:19  来源:igfitidea点击:

Simple Natural Language Processing Startup for Java

javanlp

提问by shababhsiddique

I am willing to start developing a project on NLP. I dont know much of the tools available. After googling for about a month. I realized that openNLP can be my solution.

我愿意开始开发 NLP 项目。我不知道很多可用的工具。在谷歌搜索了大约一个月后。我意识到 openNLP 可以成为我的解决方案。

Unfortunately i dont see any complete tutorial over using the API. All of them are lacking of some general steps. I need a tutorial from ground level. I have seen a lot of downloads over the site but dont know how to use them? do i need to train or something?.. Here is what i want to know-

不幸的是,我没有看到任何关于使用 API 的完整教程。所有这些都缺乏一些通用步骤。我需要一个从地面开始的教程。我在网站上看到很多下载,但不知道如何使用它们?我需要训练还是什么?..这是我想知道的-

How to install / set up a nlp system which can-

如何安装/设置一个 nlp 系统,它可以-

  1. parse a English sentence words
  2. identify the different parts of speech
  1. 解析一个英文句子词
  2. 识别不同的词性

回答by AaronD

You say that you need to 'parse' each sentence. You probably already know this, but just to be explicit, in NLP, the term 'parse' usually means to recover some hierarchical syntactic structure. The most common types are constituent structure (e.g., via a context-free grammar) and dependency structure.

你说你需要“解析”每个句子。您可能已经知道这一点,但为了明确起见,在 NLP 中,术语“解析”通常意味着恢复一些分层的句法结构。最常见的类型是组成结构(例如,通过上下文无关文法)和依赖结构。

If you need hierarchical structure, I'd recommend you consider just starting with a parser. Most parsers I'm aware of include POS tagging during parsing, and may provide higher accuracy tagging than finite-state POS taggers (Caveat - I'm much more familiar with constituent parsers than with dependency parsers. It's possible some or most dependency parsers would require POS tags as input).

如果您需要层次结构,我建议您考虑从解析器开始。我所知道的大多数解析器在解析过程中都包含 POS 标记,并且可能提供比有限状态 POS 标记器更高的准确度标记(警告 - 我对成分解析器比对依赖解析器更熟悉。部分或大多数依赖解析器可能会需要 POS 标签作为输入)。

The big downside to parsing is the time complexity. Finite-state POS taggers often run at thousands of words per second. Even greedy dependency parsers are considerably slower, and constituent parsers generally run at 1-5 sentences per second. So if you don't need hierarchical structure, you probably want to stick with a finite-state POS tagger for efficiency.

解析的最大缺点是时间复杂度。有限状态 POS 标注器通常以每秒数千个单词的速度运行。即使是贪婪的依赖解析器也相当慢,并且组成解析器通常以每秒 1-5 个句子的速度运行。因此,如果您不需要层次结构,您可能希望坚持使用有限状态 POS 标记器以提高效率。

If you do decide you need parse structure, a few recommendations:

如果您确实决定需要解析结构,请提供一些建议:

I think the Stanford parser suggested by @aab includes both a constituent parser and a dependency parser.

我认为@aab 建议的斯坦福解析器包括一个组成解析器和一个依赖解析器。

The Berkeley Parser ( http://code.google.com/p/berkeleyparser/) is a pretty well-known PCFG constituent parser, achieves state-of-the-art accuracy (equal or superior to the Stanford parser, I believe), and is reasonably efficient (~3-5 sentences per second).

Berkeley Parser ( http://code.google.com/p/berkeleyparser/) 是一个非常著名的 PCFG 成分解析器,达到了最先进的准确度(我相信与斯坦福解析器相同或优于) , 并且相当有效(每秒约 3-5 个句子)。

The BUBS Parser ( http://code.google.com/p/bubs-parser/) can also run with the high-accuracy Berkeley grammar, and improves efficiency to around 15-20 sentences/second. Full disclosure - I'm one of the primary researchers working on this parser.

BUBS Parser ( http://code.google.com/p/bubs-parser/) 也可以与高精度伯克利语法一起运行,并将效率提高到大约 15-20 个句子/秒。完全公开 - 我是研究此解析器的主要研究人员之一。

Warning: both of these parsers are research code, with all the problems that engenders. But I'd love to see people actually using BUBS, so if it's of use to you, give it a try and contact me with problems, comments, suggestions, etc.

警告:这两个解析器都是研究代码,会产生所有问题。但我很想看到人们实际使用 BUBS,所以如果它对您有用,请尝试一下,并与我联系以提出问题、意见、建议等。

And a couple Wikipedia references for background if needed:

如果需要,还有一些维基百科的背景参考:

回答by aab

Generally you'd do these two tasks in the other order:

通常,您会按其他顺序执行这两项任务:

  1. Do part-of-speech tagging
  2. Run a parser using the POS tags as input
  1. 做词性标注
  2. 使用 POS 标签作为输入运行解析器

OpenNLP's documentation isn't that thorough and some of it's gotten hard to find due to the switch to apache. Some (potentially slightly out-of-date) tutorials are available in the old SF wiki.

OpenNLP 的文档不是那么详尽,并且由于切换到 apache,其中一些文档变得很难找到。旧的SF wiki中提供了一些(可能稍微过时的)教程。

You might want to take a look at the Stanford NLP tools, in particular the Stanford POS Tagger and the Stanford Parser. Both have downloads that include pre-trained model files and they also have demo files in the top-level directory that show how to get started with the API and short shell scripts that show how to use the tools from the command-line.

您可能想看看斯坦福 NLP 工具,特别是斯坦福 POS Tagger 和斯坦福解析器。两者都有包含预训练模型文件的下载,它们还在顶级目录中有演示文件,显示如何开始使用 API 和简短的 shell 脚本,显示如何从命令行使用工具。

LingPipe might be another good toolkit to check out. A quick search here will lead you to a number of similar questions with links to other alternatives, too!

LingPipe 可能是另一个值得一试的好工具包。在这里快速搜索将带您找到许多类似的问题,以及指向其他替代方案的链接!

回答by Daniel

See Illinois-Curator: http://cogcomp.cs.illinois.edu/page/software_view/Curator

见伊利诺伊州馆长:http: //cogcomp.cs.illinois.edu/page/software_view/Curator

Demo: http://cogcomp.cs.illinois.edu/curator/demo/

演示:http: //cogcomp.cs.illinois.edu/curator/demo/

It gives you almost everything at one place.

它在一个地方为您提供几乎所有的东西。

回答by Robert Bossy

The most popular are:

最受欢迎的是:

  • GATE: easy to use and fairly quick to start with
  • UIMA: slow learning curve but more efficient and more generic
  • GATE:易于使用且上手相当快
  • UIMA:缓慢的学习曲线,但更高效、更通用