什么是用于词性标记的好 Java 库?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2293636/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 05:45:59  来源:igfitidea点击:

What is a good Java library for Parts-Of-Speech tagging?

javanlp

提问by Glenn

I'm looking for a good open source POS Taggerin Java. Here's what I have come up with so far.

我正在寻找一个好的Java开源POS 标记器。这是我到目前为止所提出的。

Anybody got any recommendations?

有人有什么建议吗?

回答by Shashikant Kore

I have used OpenNLPwith good results. You can also check out MorphAdorner.

我使用OpenNLP 取得了不错的效果。您还可以查看MorphAdorner

回答by Jo?o Silva

I've used both LingPipe and Stanford's POS Tagger. The later is a state-of-the-artPOS Tagger but, from my experience, it is too slow (although they do provide less accurate models, which are reasonably fast). Of course, it always depends on what you are trying to achieve, and there will always be a trade-off between speed and accuracy.

我使用过 LingPipe 和斯坦福的 POS Tagger。后者是最先进的POS 标记器,但根据我的经验,它太慢了(尽管它们确实提供了不太准确的模型,但速度相当快)。当然,这始终取决于您要实现的目标,并且在速度和准确性之间总是要进行权衡。

I've also once used an LBJ-based NER software and, although it was pretty accurate, the source code was a complete mess. Both LingPipe and Stanford's source is very clean and well documented.

我也曾经使用过基于 LBJ 的 NER 软件,虽然它非常准确,但源代码却是一团糟。LingPipe 和斯坦福的源代码都非常干净并且有据可查。

You can also take a look at LTAG-spinal. I haven't used it yet, but from the algorithm description, and from the listed accuracy, it sure seems better than the alternatives you have so far.

您还可以查看LTAG-spinal。我还没有使用它,但从算法描述和列出的准确性来看,它确实比您目前拥有的替代方案更好。

Hope it helps.

希望能帮助到你。

回答by hashable

Are you looking to tag POS in a specific domain? Most of the general purpose taggers are trained on newswire text. Typically they don't perform well when you are using them in specific domains (such and biomedical text). There are other taggers specifically trained for such domains such as dTagger(java) for biomedical text.

您是否要在特定域中标记 POS?大多数通用标记器都接受过新闻专线文本的培训。通常,当您在特定领域(例如生物医学文本)中使用它们时,它们的表现不佳。还有其他专门针对此类领域训练的标记器,例如用于生物医学文本的dTagger(java)。

For newswire text, Adwait Ratnaparkhi's MXPOSTis very good and is the one I would recommend.

对于新闻专线文本,Adwait Ratnaparkhi 的MXPOST非常好,是我推荐的。

Other Java implementations include:

其他 Java 实现包括:

  1. MontyLingua
  2. Berkeley Parser(Not really a POS tagger but all full blown parsers will typically include POS taggers. Google for Java syntactic parsersand you will find many.)
  3. QTag
  4. LBJ
  1. 蒙蒂语言
  2. Berkeley Parser(不是真正的 POS 标记器,但所有成熟的解析器通常都包含 POS 标记器。Google for Java 语法分析器,您会发现很多。)
  3. 标签
  4. LBJ

OpenNLPand Lingpipeas posted by the other posters are also pretty decent.

其他发帖人发布的OpenNLPLingpipe也相当不错。

Info on the state-of-the-art on POS tagging can be found here. As you can see LTAG-Spinal(also mentioned by another poster) ranks best as of now, but the variation across the various taggers is not much. I have not used LTAG myself.

可以在此处找到有关 POS 标记的最新技术的信息。如您所见,LTAG-Spinal(另一张海报也提到)目前排名最佳,但各种标记器之间的差异并不大。我自己没有使用过 LTAG。

Also note that the baseline performance for POS tagging is about 90%. Baseline means - (a) tag every word by most frequent POS tag from a lexicon, and (b) tag every unknown word as a noun.

另请注意,POS 标记的基准性能约为 90%。基线意味着 - (a) 通过词典中最常用的 POS 标签标记每个单词,以及 (b) 将每个未知单词标记为名词。