词形还原 java

Question

提问by Ilija

I am looking for a lemmatisationimplementation for English in Java. I found a few already, but I need something that does not need to much memory to run (1 GB top). Thanks. I do not need a stemmer.

我正在寻找Java 中英语的词形还原实现。我已经找到了一些，但我需要一些不需要太多内存来运行的东西（最高 1 GB）。谢谢。我不需要词干分析器。

Answer 1

回答by Zed

Check out Lucene Snowball.

查看Lucene Snowball。

Answer 2

回答by user187457

There is a JNI to hunspell, which is the checker used in open office and FireFox. http://hunspell.sourceforge.net/

有一个 JNI to hunspell，它是 open office 和 FireFox 中使用的检查器。 http://hunspell.sourceforge.net/

Answer 3

回答by Chris

The Stanford CoreNLPJava library contains a lemmatizer that is a little resource intensive but I have run it on my laptop with <512MB of RAM.

在斯坦福CoreNLPJava库包含lemmatizer就是有点资源密集型的，但我已经在我的笔记本电脑<512MB的RAM运行。

To use it:

要使用它：

Download the jar files;
Create a new project in your editor of choice/make an ant script that includes all of the jar files contained in the archive you just downloaded;
Create a new Java as shown below (based upon the snippet from Stanford's site);

下载jar文件；
在您选择的编辑器中创建一个新项目/制作一个 ant 脚本，其中包含您刚刚下载的存档中包含的所有 jar 文件；
创建一个新的 Java，如下所示（基于斯坦福网站的片段）；

import java.util.Properties;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        // StanfordCoreNLP loads a lot of models, so you probably
        // only want to do this once per execution
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();

        // create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);

        // run all Annotators on this text
        this.pipeline.annotate(document);

        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }

        return lemmas;
    }
}

Answer 4

回答by Tihamer

Chris's answer regarding the Standford Lemmatizer is great! Absolutely beautiful. He even included a pointer to the jar files, so I didn't have to google for it.

Chris 对 Standford Lemmatizer 的回答很棒！简直美极了。他甚至包含了一个指向 jar 文件的指针，所以我不必在谷歌上搜索它。

But one of his lines of code had a syntax error (he somehow switched the ending closing parentheses and semicolon in the line that begins with "lemmas.add...), and he forgot to include the imports.

但是他的一行代码有一个语法错误（他以某种方式在以“lemmas.add...”开头的行中切换了结尾的右括号和分号），并且他忘记包括导入。

As far as the NoSuchMethodError error, it's usually caused by that method not being made public static, but if you look at the code itself (at http://grepcode.com/file/repo1.maven.org/maven2/com.guokr/stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h) that is not the problem. I suspect that the problem is somewhere in the build path (I'm using Eclipse Kepler, so it was no problem configuring the 33 jar files that I use in my project).

至于 NoSuchMethodError 错误，它通常是由该方法未公开静态引起的，但是如果您查看代码本身（在http://grepcode.com/file/repo1.maven.org/maven2/com.guokr /stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h）这不是问题。我怀疑问题出在构建路径的某个地方（我使用的是 Eclipse Kepler，因此配置我在项目中使用的 33 个 jar 文件没有问题）。

Below is my minor correction of Chris's code, along with an example (my apologies to Evanescence for butchering their perfect lyrics):

下面是我对 Chris 代码的小改正，以及一个例子（我向 Evanescence 道歉，因为他们毁掉了他们完美的歌词）：

import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        /*
         * This is a pipeline that takes in a string and returns various analyzed linguistic forms. 
         * The String is tokenized via a tokenizer (such as PTBTokenizerAnnotator), 
         * and then other sequence model style annotation can be used to add things like lemmas, 
         * POS tags, and named entities. These are returned as a list of CoreLabels. 
         * Other analysis components build and store parse trees, dependency graphs, etc. 
         * 
         * This class is designed to apply multiple Annotators to an Annotation. 
         * The idea is that you first build up the pipeline by adding Annotators, 
         * and then you take the objects you wish to annotate and pass them in and 
         * get in return a fully annotated object.
         * 
         *  StanfordCoreNLP loads a lot of models, so you probably
         *  only want to do this once per execution
         */
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();
        // Create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);
        // run all Annotators on this text
        this.pipeline.annotate(document);
        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the
                // list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }
        return lemmas;
    }


    public static void main(String[] args) {
        System.out.println("Starting Stanford Lemmatizer");
        String text = "How could you be seeing into my eyes like open doors? \n"+
                "You led me down into my core where I've became so numb \n"+
                "Without a soul my spirit's sleeping somewhere cold \n"+
                "Until you find it there and led it back home \n"+
                "You woke me up inside \n"+
                "Called my name and saved me from the dark \n"+
                "You have bidden my blood and it ran \n"+
                "Before I would become undone \n"+
                "You saved me from the nothing I've almost become \n"+
                "You were bringing me to life \n"+
                "Now that I knew what I'm without \n"+
                "You can've just left me \n"+
                "You breathed into me and made me real \n"+
                "Frozen inside without your touch \n"+
                "Without your love, darling \n"+
                "Only you are the life among the dead \n"+
                "I've been living a lie, there's nothing inside \n"+
                "You were bringing me to life.";
        StanfordLemmatizer slem = new StanfordLemmatizer();
        System.out.println(slem.lemmatize(text));
    }

}

Here is my results (I was very impressed; it caught "'s" as "is" (sometimes), and did almost everything else perfectly):

这是我的结果（我印象非常深刻；它（有时）将“'s”捕获为“is”，并且几乎完美地完成了其他所有事情）：

Starting Stanford Lemmatizer

启动斯坦福词形还原

Adding annotator tokenize

添加注释器标记化

Adding annotator ssplit

添加注释器拆分

Adding annotator pos

添加注释器位置

Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.7 sec].

从 edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-dissim.tagger 读取 POS 标注器模型......完成 [1.7 秒]。

Adding annotator lemma

添加注释器引理

[how, could, you, be, see, into, my, eye, like, open, door, ?, you, lead, I, down, into, my, core, where, I, have, become, so, numb, without, a, soul, my, spirit, 's, sleep, somewhere, cold, until, you, find, it, there, and, lead, it, back, home, you, wake, I, up, inside, call, my, name, and, save, I, from, the, dark, you, have, bid, my, blood, and, it, run, before, I, would, become, undo, you, save, I, from, the, nothing, I, have, almost, become, you, be, bring, I, to, life, now, that, I, know, what, I, be, without, you, can, have, just, leave, I, you, breathe, into, I, and, make, I, real, frozen, inside, without, you, touch, without, you, love, ,, darling, only, you, be, the, life, among, the, dead, I, have, be, live, a, lie, ,, there, be, nothing, inside, you, be, bring, I, to, life, .]

[怎么，可能，你，是，看到，进入，我的，眼睛，喜欢，打开，门，？，你，领导，我，下来，进入，我的，核心，哪里，我，有，成为，所以，麻木，没有，一个，灵魂，我的，精神，的，睡眠，某处，冷，直到，你，找到，它，那里，和，领导，它，回来，家，你，醒来，我，起来，里面，打电话，我的，名字，和，保存，我，从，黑暗，你，有，出价，我的，血，和，它，跑，之前，我，会，成为，撤消，你，保存，我，从，没有，我，有，几乎，成为，你，是，带来，我，到，生活，现在，那个，我，知道，什么，我，是，没有，你，可以，有，只是，离开，我，你，呼吸，进入，我，和，让，我，真实的，冻结的，里面，没有，你，触摸，没有，你，爱，，，亲爱的，只有，你，是，生活，其中，死了，我，有，是，生活，一个，谎言，，，那里，是，什么都没有，里面，你，是，带来，我，到，生活，。]

Answer 5

回答by Joseph Shih

You can try the free Lemmatizer API here: http://twinword.com/lemmatizer.php

您可以在这里试用免费的 Lemmatizer API：http: //twinword.com/lemmatizer.php

Scroll down to find the Lemmatizer endpoint.

向下滚动以找到 Lemmatizer 端点。

This will allow you to get "dogs" to "dog", "abilities" to "ability".

这将允许您将“狗”变为“狗”，将“能力”变为“能力”。

If you pass in a POST or GET parameter called "text" with a string like "walked plants":

如果您传入一个名为“text”的 POST 或 GET 参数，其中包含一个类似“walked植物”的字符串：

// These code snippets use an open-source library. http://unirest.io/java
HttpResponse<JsonNode> response = Unirest.post("[ENDPOINT URL]")
.header("X-Mashape-Key", "[API KEY]")
.header("Content-Type", "application/x-www-form-urlencoded")
.header("Accept", "application/json")
.field("text", "walked plants")
.asJson();

You get a response like this:

你会得到这样的回应：

{
  "lemma": {
    "plant": 1,
    "walk": 1
  },
  "result_code": "200",
  "result_msg": "Success"
}

词形还原 java

提问by Ilija

回答by Zed

回答by user187457

回答by Chris

回答by Tihamer

回答by Joseph Shih

相关推荐

最近更新

标签

词形还原 java

提问by Ilija

回答by Zed

回答by user187457

回答by Chris

回答by Tihamer

回答by Joseph Shih

相关推荐

Java 将 inputStream 转换为 URL

Java PBKDF2WithHmacSHA512 对比。PBKDF2WithHmacSHA1

Java 从扩展外部类本身的内部类中访问外部类成员

Java 在属性文件中逐行添加注释

相关推荐

最近更新

标签