词形还原 java

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1578062/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 16:25:45  来源:igfitidea点击:

Lemmatization java

javanlp

提问by Ilija

I am looking for a lemmatisationimplementation for English in Java. I found a few already, but I need something that does not need to much memory to run (1 GB top). Thanks. I do not need a stemmer.

我正在寻找Java 中英语的词形还原实现。我已经找到了一些,但我需要一些不需要太多内存来运行的东西(最高 1 GB)。谢谢。我不需要词干分析器。

回答by Zed

Check out Lucene Snowball.

查看Lucene Snowball

回答by user187457

There is a JNI to hunspell, which is the checker used in open office and FireFox. http://hunspell.sourceforge.net/

有一个 JNI to hunspell,它是 open office 和 FireFox 中使用的检查器。 http://hunspell.sourceforge.net/

回答by Chris

The Stanford CoreNLPJava library contains a lemmatizer that is a little resource intensive but I have run it on my laptop with <512MB of RAM.

斯坦福CoreNLPJava库包含lemmatizer就是有点资源密集型的,但我已经在我的笔记本电脑<512MB的RAM运行。

To use it:

要使用它:

  1. Download the jar files;
  2. Create a new project in your editor of choice/make an ant script that includes all of the jar files contained in the archive you just downloaded;
  3. Create a new Java as shown below (based upon the snippet from Stanford's site);
  1. 下载jar文件
  2. 在您选择的编辑器中创建一个新项目/制作一个 ant 脚本,其中包含您刚刚下载的存档中包含的所有 jar 文件;
  3. 创建一个新的 Java,如下所示(基于斯坦福网站的片段);
import java.util.Properties;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        // StanfordCoreNLP loads a lot of models, so you probably
        // only want to do this once per execution
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();

        // create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);

        // run all Annotators on this text
        this.pipeline.annotate(document);

        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }

        return lemmas;
    }
}

回答by Tihamer

Chris's answer regarding the Standford Lemmatizer is great! Absolutely beautiful. He even included a pointer to the jar files, so I didn't have to google for it.

Chris 对 Standford Lemmatizer 的回答很棒!简直美极了。他甚至包含了一个指向 jar 文件的指针,所以我不必在谷歌上搜索它。

But one of his lines of code had a syntax error (he somehow switched the ending closing parentheses and semicolon in the line that begins with "lemmas.add...), and he forgot to include the imports.

但是他的一行代码有一个语法错误(他以某种方式在以“lemmas.add...”开头的行中切换了结尾的右括号和分号),并且他忘记包括导入。

As far as the NoSuchMethodError error, it's usually caused by that method not being made public static, but if you look at the code itself (at http://grepcode.com/file/repo1.maven.org/maven2/com.guokr/stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h) that is not the problem. I suspect that the problem is somewhere in the build path (I'm using Eclipse Kepler, so it was no problem configuring the 33 jar files that I use in my project).

至于 NoSuchMethodError 错误,它通常是由该方法未公开静态引起的,但是如果您查看代码本身(在http://grepcode.com/file/repo1.maven.org/maven2/com.guokr /stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h)这不是问题。我怀疑问题出在构建路径的某个地方(我使用的是 Eclipse Kepler,因此配置我在项目中使用的 33 个 jar 文件没有问题)。

Below is my minor correction of Chris's code, along with an example (my apologies to Evanescence for butchering their perfect lyrics):

下面是我对 Chris 代码的小改正,以及一个例子(我向 Evanescence 道歉,因为他们毁掉了他们完美的歌词):

import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        /*
         * This is a pipeline that takes in a string and returns various analyzed linguistic forms. 
         * The String is tokenized via a tokenizer (such as PTBTokenizerAnnotator), 
         * and then other sequence model style annotation can be used to add things like lemmas, 
         * POS tags, and named entities. These are returned as a list of CoreLabels. 
         * Other analysis components build and store parse trees, dependency graphs, etc. 
         * 
         * This class is designed to apply multiple Annotators to an Annotation. 
         * The idea is that you first build up the pipeline by adding Annotators, 
         * and then you take the objects you wish to annotate and pass them in and 
         * get in return a fully annotated object.
         * 
         *  StanfordCoreNLP loads a lot of models, so you probably
         *  only want to do this once per execution
         */
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();
        // Create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);
        // run all Annotators on this text
        this.pipeline.annotate(document);
        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the
                // list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }
        return lemmas;
    }


    public static void main(String[] args) {
        System.out.println("Starting Stanford Lemmatizer");
        String text = "How could you be seeing into my eyes like open doors? \n"+
                "You led me down into my core where I've became so numb \n"+
                "Without a soul my spirit's sleeping somewhere cold \n"+
                "Until you find it there and led it back home \n"+
                "You woke me up inside \n"+
                "Called my name and saved me from the dark \n"+
                "You have bidden my blood and it ran \n"+
                "Before I would become undone \n"+
                "You saved me from the nothing I've almost become \n"+
                "You were bringing me to life \n"+
                "Now that I knew what I'm without \n"+
                "You can've just left me \n"+
                "You breathed into me and made me real \n"+
                "Frozen inside without your touch \n"+
                "Without your love, darling \n"+
                "Only you are the life among the dead \n"+
                "I've been living a lie, there's nothing inside \n"+
                "You were bringing me to life.";
        StanfordLemmatizer slem = new StanfordLemmatizer();
        System.out.println(slem.lemmatize(text));
    }

}

Here is my results (I was very impressed; it caught "'s" as "is" (sometimes), and did almost everything else perfectly):

这是我的结果(我印象非常深刻;它(有时)将“'s”捕获为“is”,并且几乎完美地完成了其他所有事情):

Starting Stanford Lemmatizer

启动斯坦福词形还原

Adding annotator tokenize

添加注释器标记化

Adding annotator ssplit

添加注释器拆分

Adding annotator pos

添加注释器位置

Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.7 sec].

从 edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-dissim.tagger 读取 POS 标注器模型......完成 [1.7 秒]。

Adding annotator lemma

添加注释器引理

[how, could, you, be, see, into, my, eye, like, open, door, ?, you, lead, I, down, into, my, core, where, I, have, become, so, numb, without, a, soul, my, spirit, 's, sleep, somewhere, cold, until, you, find, it, there, and, lead, it, back, home, you, wake, I, up, inside, call, my, name, and, save, I, from, the, dark, you, have, bid, my, blood, and, it, run, before, I, would, become, undo, you, save, I, from, the, nothing, I, have, almost, become, you, be, bring, I, to, life, now, that, I, know, what, I, be, without, you, can, have, just, leave, I, you, breathe, into, I, and, make, I, real, frozen, inside, without, you, touch, without, you, love, ,, darling, only, you, be, the, life, among, the, dead, I, have, be, live, a, lie, ,, there, be, nothing, inside, you, be, bring, I, to, life, .]

[怎么,可能,你,是,看到,进入,我的,眼睛,喜欢,打开,门,?,你,领导,我,下来,进入,我的,核心,哪里,我,有,成为,所以,麻木,没有,一个,灵魂,我的,精神,的,睡眠,某处,冷,直到,你,找到,它,那里,和,领导,它,回来,家,你,醒来,我,起来,里面,打电话,我的,名字,和,保存,我,从,黑暗,你,有,出价,我的,血,和,它,跑,之前,我,会,成为,撤消,你,保存,我,从,没有,我,有,几乎,成为,你,是,带来,我,到,生活,现在,那个,我,知道,什么,我,是,没有,你,可以,有,只是,离开,我,你,呼吸,进入,我,和,让,我,真实的,冻结的,里面,没有,你,触摸,没有,你,爱,,,亲爱的,只有,你,是,生活,其中,死了,我,有,是,生活,一个,谎言,,,那里,是,什么都没有,里面,你,是,带来,我,到,生活,。]

回答by Joseph Shih

You can try the free Lemmatizer API here: http://twinword.com/lemmatizer.php

您可以在这里试用免费的 Lemmatizer API:http: //twinword.com/lemmatizer.php

Scroll down to find the Lemmatizer endpoint.

向下滚动以找到 Lemmatizer 端点。

This will allow you to get "dogs" to "dog", "abilities" to "ability".

这将允许您将“狗”变为“狗”,将“能力”变为“能力”。

If you pass in a POST or GET parameter called "text" with a string like "walked plants":

如果您传入一个名为“text”的 POST 或 GET 参数,其中包含一个类似“walked植物”的字符串:

// These code snippets use an open-source library. http://unirest.io/java
HttpResponse<JsonNode> response = Unirest.post("[ENDPOINT URL]")
.header("X-Mashape-Key", "[API KEY]")
.header("Content-Type", "application/x-www-form-urlencoded")
.header("Accept", "application/json")
.field("text", "walked plants")
.asJson();

You get a response like this:

你会得到这样的回应:

{
  "lemma": {
    "plant": 1,
    "walk": 1
  },
  "result_code": "200",
  "result_msg": "Success"
}