从文本中提取名词 (Java)

Question

提问by Phil

Does anyone know the easiest way to extract only nouns from a body of text?

有谁知道从文本正文中仅提取名词的最简单方法吗？

I've heard about the TreeTagger tooland I tried giving it a shot but couldn't get it to work for some reason.

我听说过TreeTagger 工具，我尝试尝试一下，但由于某种原因无法让它工作。

Any suggestions?

有什么建议？

Thanks Phil

谢谢菲尔

EDIT:

编辑：

 import org.annolab.tt4j.*; 
TreeTaggerWrapper tt = new TreeTaggerWrapper(); 

try { tt.setModel("/Nouns/english.par"); 

tt.setHandler(new TokenHandler() { 
     void token(String token, String pos, String lemma) {    
     System.out.println(token+"\t"+pos+"\t"+lemma); } }); 
     tt.process(words); // words = list of words 

     } finally { tt.destroy(); 
}

That is my code, English is the language. I was getting the error : The type new TokenHandler(){} must implement the inherited abstract method TokenHandler.token. Am I doing something wrong?

那是我的代码，英语是语言。我收到错误：类型 new TokenHandler(){} 必须实现继承的抽象方法 TokenHandler.token。难道我做错了什么？

Answer 1

回答by peter.murray.rust

First you will have to tokenize your text. This may seem trivial (split at any whitespace may work for you) but formally it is harder. Then you have to decide what is a noun. Does "the car park" contain one noun (car park), two nouns (car, park) or one noun (park) and one adjective (car)? This is a hard problem, but again you may be able to get by without it.

首先，您必须标记您的文本。这可能看起来微不足道（在任何空白处拆分可能对您有用），但从形式上讲它更难。然后你必须决定什么是名词。“the car park”包含一个名词（car park）、两个名词（car, park）还是一个名词（park）和一个形容词（car）？这是一个棘手的问题，但同样，如果没有它，您也可以解决问题。

Does "I saw the xyzzy" identify a noun not in a dictionary? The word "the" probably identifies xyzzy as a noun.

“我看到了 xyzzy”是否可以识别字典中没有的名词？“the”一词可能将 xyzzy 标识为名词。

Where are the nouns in "time flies like an arrow". Compare with "fruit flies like a banana" (thanks to Groucho Marx).

“时光如箭”中的名词在哪里？与“像香蕉一样的果蝇”（感谢 Groucho Marx）进行比较。

We use the Brown tagger (Java) (http://en.wikipedia.org/wiki/Brown_Corpus) in the OpenNLP toolkit (opennlp.tools.lang.english.PosTagger; opennlp.tools.postag.POSDictionary on http://opennlp.sourceforge.net/) to find nouns in normal English and I'd recommend starting with that - it does most of your thinking for you. Otherwise look at any of the POSTaggers (http://en.wikipedia.org/wiki/POS_tagger) or (http://www-nlp.stanford.edu/links/statnlp.html#Taggers).

我们用布朗恶搞（JAVA）（http://en.wikipedia.org/wiki/Brown_Corpus在OpenNLP工具包）（opennlp.tools.lang.english.PosTagger; opennlp.tools.postag.POSDictionary上的http：// opennlp.sourceforge.net/) 来查找普通英语中的名词，我建议从它开始 - 它可以为您完成大部分思考。否则查看任何 POSTaggers ( http://en.wikipedia.org/wiki/POS_tagger) 或 ( http://www-nlp.stanford.edu/links/statnlp.html#Taggers)。

In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English, for example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus)

在计算机的词性标注中，通常要区分 50 到 150 个单独的英语词性，例如，NN 表示单数普通名词，NNS 表示复数普通名词，NP 表示单数专有名词（见 POS布朗语料库中使用的标签）

There is a very full list of NLP toolkits in http://en.wikipedia.org/wiki/Natural_language_processing_toolkits. I would strongly suggest you use one of those rather than trying to match against Wordnet or other collections.

在http://en.wikipedia.org/wiki/Natural_language_processing_toolkits 中有一个非常完整的 NLP 工具包列表。我强烈建议您使用其中之一，而不是尝试与 Wordnet 或其他集合进行匹配。

Answer 2

回答by teabot

Check out LingPipe. This can supposedly pick out named entitiesfrom English text. But I must confess that NLPisn't my area of expertise.

查看LingPipe。据说这可以从英文文本中挑选出命名实体。但我必须承认，NLP不是我的专业领域。

Answer 3

回答by Maximilian Mayerl

Based on your edit:

根据您的编辑：

The error says that you must override the abstract method token, and you have a definition for token in your anonymous inner class, but maybe the signature of your token-override doesn't match the signature of the abstract method defined in TokenHandler?

该错误表示您必须覆盖抽象方法令牌，并且您在匿名内部类中有一个令牌定义，但是您的令牌覆盖的签名可能与 TokenHandler 中定义的抽象方法的签名不匹配？

Answer 4

回答by khadre

my following code works with TreeTagger:

我的以下代码适用于 TreeTagger：

public List<String> tag(String str) {
    final List<String> tagLemme = new ArrayList<String>();
    String[] tokens =tokenizer.tokenize(str);
      System.setProperty("treetagger.home", "parametresTreeTagger/TreeTagger");
    TreeTaggerWrapper tt = new TreeTaggerWrapper<String>();
    try {
        tt.setModel("parametresTreeTagger/english/english.par");
        tt.setHandler(new TokenHandler<String>(){
                public void token(String token, String pos, String lemma) {
                        tagLemme.add(token + "_" + pos + "_" + lemma);
                        //System.out.println(token + "_" + pos + "_" + lemma);
                }
        });
        tt.process(asList(tokens));
     } catch (IOException e) {
        e.printStackTrace();
      } catch (TreeTaggerException e) {
        e.printStackTrace();
    }
finally {
        tt.destroy();
}
    return tagLemme;
}

Answer 5

回答by High Performance Mark

Easiest way would probably be to compare each word in the text with a dictionary of nouns. After that you're probably going to have to do some elementary parsing and accept approximate correctness in the results. Lots of online references to parsing natural languages.

最简单的方法可能是将文本中的每个单词与名词词典进行比较。之后，您可能将不得不进行一些基本解析并接受结果的近似正确性。许多关于解析自然语言的在线参考资料。

Answer 6

回答by torbengee

Find a dictionary web site with an API (e.g. WS, RESTful) which you can use to run search queries against.

查找带有 API（例如 WS、RESTful）的词典网站，您可以使用它来运行搜索查询。

The results should come in an easily consumable format (e.g. XML, JSON) and should of course include the word's lexical category.

结果应采用易于使用的格式（例如 XML、JSON），当然还应包括单词的词汇类别。

Answer 7

回答by Scharrels

Have a look at the WordNetdatabase. This lexical database. You could try matching each word against it and check if it's a noun.

看看WordNet数据库。这个词库。您可以尝试将每个单词与它匹配并检查它是否是名词。

I doubt that you will have 100% precision, though; the database doesn't have a match for every possible word in the english language, but at least it's a start.

不过，我怀疑您是否会获得 100% 的精确度；数据库没有匹配英语中每个可能的单词，但至少它是一个开始。

从文本中提取名词 (Java)

提问by Phil

回答by peter.murray.rust

回答by teabot

回答by Maximilian Mayerl

回答by khadre

回答by High Performance Mark

回答by torbengee

回答by Scharrels

相关推荐

最近更新

标签

从文本中提取名词 (Java)

提问by Phil

回答by peter.murray.rust

回答by teabot

回答by Maximilian Mayerl

回答by khadre

回答by High Performance Mark

回答by torbengee

回答by Scharrels

相关推荐

使用 Java Servlet 提供 Gzip 压缩的内容

java 如何将字符串数组转换为唯一值数组？

从 Java 程序访问 NFS 共享

java Spring Security 对用户进行身份验证时如何在会话中管理自定义用户对象？

相关推荐

最近更新

标签