从文本中提取名词 (Java)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1889675/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract Nouns from Text (Java)
提问by Phil
Does anyone know the easiest way to extract only nouns from a body of text?
有谁知道从文本正文中仅提取名词的最简单方法吗?
I've heard about the TreeTagger tooland I tried giving it a shot but couldn't get it to work for some reason.
我听说过TreeTagger 工具,我尝试尝试一下,但由于某种原因无法让它工作。
Any suggestions?
有什么建议?
Thanks Phil
谢谢菲尔
EDIT:
编辑:
import org.annolab.tt4j.*;
TreeTaggerWrapper tt = new TreeTaggerWrapper();
try { tt.setModel("/Nouns/english.par");
tt.setHandler(new TokenHandler() {
void token(String token, String pos, String lemma) {
System.out.println(token+"\t"+pos+"\t"+lemma); } });
tt.process(words); // words = list of words
} finally { tt.destroy();
}
That is my code, English is the language. I was getting the error : The type new TokenHandler(){} must implement the inherited abstract method TokenHandler.token. Am I doing something wrong?
那是我的代码,英语是语言。我收到错误:类型 new TokenHandler(){} 必须实现继承的抽象方法 TokenHandler.token。难道我做错了什么?
回答by peter.murray.rust
First you will have to tokenize your text. This may seem trivial (split at any whitespace may work for you) but formally it is harder. Then you have to decide what is a noun. Does "the car park" contain one noun (car park), two nouns (car, park) or one noun (park) and one adjective (car)? This is a hard problem, but again you may be able to get by without it.
首先,您必须标记您的文本。这可能看起来微不足道(在任何空白处拆分可能对您有用),但从形式上讲它更难。然后你必须决定什么是名词。“the car park”包含一个名词(car park)、两个名词(car, park)还是一个名词(park)和一个形容词(car)?这是一个棘手的问题,但同样,如果没有它,您也可以解决问题。
Does "I saw the xyzzy" identify a noun not in a dictionary? The word "the" probably identifies xyzzy as a noun.
“我看到了 xyzzy”是否可以识别字典中没有的名词?“the”一词可能将 xyzzy 标识为名词。
Where are the nouns in "time flies like an arrow". Compare with "fruit flies like a banana" (thanks to Groucho Marx).
“时光如箭”中的名词在哪里?与“像香蕉一样的果蝇”(感谢 Groucho Marx)进行比较。
We use the Brown tagger (Java) (http://en.wikipedia.org/wiki/Brown_Corpus) in the OpenNLP toolkit (opennlp.tools.lang.english.PosTagger; opennlp.tools.postag.POSDictionary on http://opennlp.sourceforge.net/) to find nouns in normal English and I'd recommend starting with that - it does most of your thinking for you. Otherwise look at any of the POSTaggers (http://en.wikipedia.org/wiki/POS_tagger) or (http://www-nlp.stanford.edu/links/statnlp.html#Taggers).
我们用布朗恶搞(JAVA)(http://en.wikipedia.org/wiki/Brown_Corpus在OpenNLP工具包)(opennlp.tools.lang.english.PosTagger; opennlp.tools.postag.POSDictionary上的http:// opennlp.sourceforge.net/) 来查找普通英语中的名词,我建议从它开始 - 它可以为您完成大部分思考。否则查看任何 POSTaggers ( http://en.wikipedia.org/wiki/POS_tagger) 或 ( http://www-nlp.stanford.edu/links/statnlp.html#Taggers)。
In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English, for example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus)
在计算机的词性标注中,通常要区分 50 到 150 个单独的英语词性,例如,NN 表示单数普通名词,NNS 表示复数普通名词,NP 表示单数专有名词(见 POS布朗语料库中使用的标签)
There is a very full list of NLP toolkits in http://en.wikipedia.org/wiki/Natural_language_processing_toolkits. I would strongly suggest you use one of those rather than trying to match against Wordnet or other collections.
在http://en.wikipedia.org/wiki/Natural_language_processing_toolkits 中有一个非常完整的 NLP 工具包列表。我强烈建议您使用其中之一,而不是尝试与 Wordnet 或其他集合进行匹配。
回答by teabot
回答by Maximilian Mayerl
Based on your edit:
根据您的编辑:
The error says that you must override the abstract method token, and you have a definition for token in your anonymous inner class, but maybe the signature of your token-override doesn't match the signature of the abstract method defined in TokenHandler?
该错误表示您必须覆盖抽象方法令牌,并且您在匿名内部类中有一个令牌定义,但是您的令牌覆盖的签名可能与 TokenHandler 中定义的抽象方法的签名不匹配?
回答by khadre
my following code works with TreeTagger:
我的以下代码适用于 TreeTagger:
public List<String> tag(String str) {
final List<String> tagLemme = new ArrayList<String>();
String[] tokens =tokenizer.tokenize(str);
System.setProperty("treetagger.home", "parametresTreeTagger/TreeTagger");
TreeTaggerWrapper tt = new TreeTaggerWrapper<String>();
try {
tt.setModel("parametresTreeTagger/english/english.par");
tt.setHandler(new TokenHandler<String>(){
public void token(String token, String pos, String lemma) {
tagLemme.add(token + "_" + pos + "_" + lemma);
//System.out.println(token + "_" + pos + "_" + lemma);
}
});
tt.process(asList(tokens));
} catch (IOException e) {
e.printStackTrace();
} catch (TreeTaggerException e) {
e.printStackTrace();
}
finally {
tt.destroy();
}
return tagLemme;
}
回答by High Performance Mark
Easiest way would probably be to compare each word in the text with a dictionary of nouns. After that you're probably going to have to do some elementary parsing and accept approximate correctness in the results. Lots of online references to parsing natural languages.
最简单的方法可能是将文本中的每个单词与名词词典进行比较。之后,您可能将不得不进行一些基本解析并接受结果的近似正确性。许多关于解析自然语言的在线参考资料。
回答by torbengee
Find a dictionary web site with an API (e.g. WS, RESTful) which you can use to run search queries against.
查找带有 API(例如 WS、RESTful)的词典网站,您可以使用它来运行搜索查询。
The results should come in an easily consumable format (e.g. XML, JSON) and should of course include the word's lexical category.
结果应采用易于使用的格式(例如 XML、JSON),当然还应包括单词的词汇类别。
回答by Scharrels
Have a look at the WordNetdatabase. This lexical database. You could try matching each word against it and check if it's a noun.
看看WordNet数据库。这个词库。您可以尝试将每个单词与它匹配并检查它是否是名词。
I doubt that you will have 100% precision, though; the database doesn't have a match for every possible word in the english language, but at least it's a start.
不过,我怀疑您是否会获得 100% 的精确度;数据库没有匹配英语中每个可能的单词,但至少它是一个开始。

