java 从给定文本中提取英语动词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5404243/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
extracting English verbs from a given text
提问by jarandaf
I need to extract all English verbs from a given text and I was wondering how I could do it... At first glance, my idea is to use regular expressions because all English verb tenses follow patterns but maybe there is another way to do it. What I've thought is simply:
我需要从给定的文本中提取所有英语动词,我想知道我该怎么做...乍一看,我的想法是使用正则表达式,因为所有英语动词时态都遵循模式,但也许还有另一种方法可以做到. 我的想法很简单:
- Create a pattern for every verb tense. I have to distinguish between regular verbs (http://en.wikipedia.org/wiki/English_verbs) and irregular verbs (http://www.chompchomp.com/rules/irregularrules01.htm) in some way.
- Iterate over these patterns and split the text using them (the last word of each substring is supposed to be the verb that gives complete meaning to the sentence, which I need for other purposes -> nominalization)
- 为每个动词时态创建一个模式。我必须以某种方式区分规则动词(http://en.wikipedia.org/wiki/English_verbs)和不规则动词(http://www.chompchomp.com/rules/irregularrules01.htm)。
- 迭代这些模式并使用它们拆分文本(每个子串的最后一个单词应该是赋予句子完整含义的动词,我需要用于其他目的 -> 名词化)
What do you think? I guess this isn't an efficient way to do it but I can't imagine another one.
你怎么认为?我想这不是一种有效的方法,但我无法想象另一种方法。
Thank you in advance!
先感谢您!
PS:
PS:
- I have two dictionaries, one for all English Verbs and the other one for all English nouns
- The main problem of all this is that the project consists on verb nominalization (is just a uni project), so all the "effort" is supposed to be focused in this part, nominalization. In concrete, I follow this model: acl.ldc.upenn.edu/P/P00/P00-1037.pdf). The project consists on given a text, find all the verbs in that text and propose multiple nominalizations for each verb. So the first step (finding verbs), should be as simple as possible... but I can't use any parser, it's not allowed
- 我有两本词典,一本用于所有英语动词,另一本用于所有英语名词
- 这一切的主要问题是该项目包括动词名词化(只是一个 uni 项目),所以所有的“努力”都应该集中在这部分,名词化。具体来说,我遵循这个模型:acl.ldc.upenn.edu/P/P00/P00-1037.pdf)。该项目包括给定一个文本,找到该文本中的所有动词并为每个动词提出多个名词化。所以第一步(找动词),应该尽可能简单......但我不能使用任何解析器,这是不允许的
回答by dmcer
Part of Speech tagger
语音标注器的一部分
Identifying and then extracting all the verbs within a text is very easyusing a Part-of-Speech (POS) tagger. Such taggers label all of the words in a text with part-of-speech tags that indicate whether they are verbs, nouns, adjectives, adverbs, etc. Modern POS taggers are very accurate. For example, Toutanova et al. 2003 reports Stanford's open source POS tagger assigns the correct tag 97.24% of timeon newswire data.
使用词性 (POS) 标记器识别并提取文本中的所有动词非常容易。此类标注器使用词性标签标记文本中的所有单词,表明它们是动词、名词、形容词、副词等。现代 POS 标注器非常准确。例如,图塔诺瓦等人。2003 年报告斯坦福的开源 POS 标记器在新闻专线数据上分配正确标记的时间为 97.24%。
Performing POS tagging
执行 POS 标记
JavaIf you're using Java, a good package for POS tagging is the Stanford Log-linear Part-Of-Speech Tagger. Matthew Jockers put together a great tutorial on using this tagger that you can find here.
Java如果您使用的是 Java,那么一个很好的词性标记包是Stanford Log-linear Part-Of-Speech Tagger。Matthew Jockers 整理了一个关于使用此标记器的精彩教程,您可以在此处找到。
PythonIf you prefer Python, you can make use of the POS tagger included in the Natural Language Toolkit (nltk). A code snippet demonstrating how to perform POS tagging using this package is given below:
Python如果您更喜欢 Python,则可以使用自然语言工具包 (nltk) 中包含的词性标注器。下面给出了演示如何使用此包执行 POS 标记的代码片段:
import nltk
text = "I am very happy to be here today"
tokens = nltk.word_tokenize(text)
pos_tagged_tokens = nltk.pos_tag(tokens)
The resulting POS tagged tokens will be an array of tuples, where the first entry in each tuple is the identity of the tagged word and the second entry is the word's POS tag, e.g. for the code snippet above pos_tagged_tokens
will be set to:
生成的 POS 标记标记将是一个元组数组,其中每个元组中的第一个条目是标记单词的标识,第二个条目是单词的 POS 标记,例如上面的代码片段pos_tagged_tokens
将设置为:
[('I', 'PRP'), ('am', 'VBP'), ('very', 'RB'), ('happy', 'JJ'), ('to', 'TO'),
('be', 'VB'), ('here', 'RB'), ('today', 'NN')]
Understanding the Tag Set
了解标签集
Both the Stanford POS tagger and NLTK use the Penn Treebank tag set. If you're just interested in extracting the verbs, pull out all words that have a POS tag that starts with a "V" (e.g., VB, VBD, VBG, VBN, VBP, and VBZ).
斯坦福 POS 标记器和 NLTK 都使用Penn Treebank 标记集。如果您只是对提取动词感兴趣,请提取所有具有以“V”开头的词性标签的单词(例如,VB、VBD、VBG、VBN、VBP 和 VBZ)。
回答by Sean Patrick Floyd
Parsing natural language with regex is impossible. Forget it.
用正则表达式解析自然语言是不可能的。忘了它。
As a drastic example: How would you find the verbs (marked with asterisks) in this sentence?
作为一个激烈的例子:你会如何找到这句话中的动词(标有星号)?
Buffalo buffalo Buffalo buffalo buffalo* buffalo* Buffalo buffalo
布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗布法罗
While you'll hardly come across extreme cases like this, there are dozens of verbs that could also be nouns, adjectives etc if you just look at the word.
虽然你几乎不会遇到这样的极端情况,但如果你只看这个词,有很多动词也可以是名词、形容词等。
You need a natural language parser like Stanford NLP. I have never used one, so I don't know how good your results are going to be, but better than with Regex, I can tell you that.
你需要一个像斯坦福 NLP这样的自然语言解析器。我从来没有用过,所以我不知道你的结果会有多好,但比使用 Regex 好,我可以告诉你。
回答by myro
Although one year later, but I found a very useful tool from Northwestern University called MorphAdorner.
虽然一年后,但我从西北大学找到了一个非常有用的工具,叫做MorphAdorner。
It handles all kind of situations, e.g. lemmatization, language recognition, name recognition, parser, sentence splitter, etc..
它处理各种情况,例如词形还原、语言识别、名称识别、解析器、句子拆分器等。
Convenient easy to use.
方便易用。
回答by Noam Weiss
This is actually a very hard task in NLP (Natural Language Processing). Regular expressions on there own won't be enough. Take, for example, the word "training" - it can be used as either a verb or a noun ("I'm going to the training session"). Obviously, a regular expression won't be able to tell the difference between the two. There are problems as well, the "-ed" is a common way to end past tense verbs, but will fail you in the case of "disgusted".
这实际上是 NLP(自然语言处理)中一项非常艰巨的任务。仅靠正则表达式是不够的。以“培训”一词为例,它既可以用作动词也可以用作名词(“我要去参加培训课程”)。显然,正则表达式将无法区分两者之间的区别。也有问题,“-ed”是结束过去时动词的常用方式,但在“厌恶”的情况下会让你失望。
There are some techniques that can provide you with good (not perfect, but good) indication of whether or not a given word is a verb or not - they can also be quite expensive computationally.
有一些技术可以为您提供良好(不完美,但很好)的指示,指示给定的单词是否是动词 - 它们在计算上也可能非常昂贵。
So the first question you should ask yourself (in my opinion), is what quality of answer vs. how much processing time are you interested in.
因此,您应该问自己的第一个问题(在我看来)是回答的质量与您对多少处理时间感兴趣。