php 产生真实单词的词干算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/190775/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Stemming algorithm that produces real words
提问by Dave
I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities
我需要取一段文本并从中提取“标签”列表。其中大部分是非常直接的。但是,我现在需要一些帮助来阻止生成的单词列表以避免重复。示例:社区/社区
I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):
我使用了 Porter Stemmer 算法的实现(顺便说一下,我正在用 PHP 编写):
http://tartarus.org/~martin/PorterStemmer/php.txt
http://tartarus.org/~martin/PorterStemmer/php.txt
This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".
这在一定程度上有效,但不会返回“真实”单词。上面的例子源于“commun”。
I've tried "Snowball" (suggested within another Stack Overflow thread).
我试过“雪球”(在另一个 Stack Overflow 线程中建议)。
http://snowball.tartarus.org/demo.php
http://snowball.tartarus.org/demo.php
For my example (community / communities), Snowball stems to "communiti".
对于我的示例(社区/社区),Snowball 源于“社区”。
Question
题
Are there any other stemming algorithms that will do this? Has anyone else solved this problem?
有没有其他词干算法可以做到这一点?有没有其他人解决过这个问题?
My current thinking is that I could use a stemming algorithm to avoid duplicates and then pick the shortest word I encounter to be the actual word to display.
我目前的想法是我可以使用词干算法来避免重复,然后选择我遇到的最短单词作为要显示的实际单词。
采纳答案by Dave Sherohman
The core issue here is that stemming algorithms operate on a phonetic basispurely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:
这里的核心问题是词干算法在语音基础上运行,纯粹基于语言的拼写规则,而没有真正理解它们所使用的语言。要生成真实的单词,您可能必须将词干分析器的输出与某种形式的查找功能合并,以将词干转换回真实的单词。我基本上可以看到两种可能的方法来做到这一点:
- Locate or create a large dictionary which maps each possible stem back to an actual word. (e.g., communiti -> community)
- Create a function which compares each stem to a list of the words that were reduced to that stem and attempts to determine which is most similar. (e.g., comparing "communiti" against "community" and "communities" in such a way that "community" will be recognized as the more similar option)
- 找到或创建一个大词典,将每个可能的词干映射回实际单词。(例如,社区 -> 社区)
- 创建一个函数,将每个词干与缩减为该词干的单词列表进行比较,并尝试确定哪个词最相似。(例如,将“communiti”与“community”和“communities”进行比较,以便将“community”视为更相似的选项)
Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.
就我个人而言,我认为我会做的方式是#1 的动态形式,通过记录检查的每个单词及其词干,然后假设最常见的单词是应该是的单词来构建自定义词典数据库用过的。(例如,如果我的源文本正文比“社区”更频繁地使用“社区”,则映射社区 -> 社区。)基于字典的方法通常会更准确,并且基于词干输入构建它会提供结果根据您的文本进行定制,主要缺点是所需的空间,如今这通常不是问题。
回答by Kaarel
If I understand correctly, then what you need is not a stemmer but a lemmatizer. Lemmatizer is a tool with knowledge about endings like -ies, -ed, etc., and exceptional wordforms like written, etc. Lemmatizer maps the input wordform to its lemma, which is guaranteed to be a "real" word.
如果我理解正确,那么您需要的不是词干提取器而是词形还原器。Lemmatizer是关于像结局知识的工具-ies,-ed等,像特殊的词形等书面等Lemmatizer输入wordform映射到其引理,这是保证是一个“实”字。
There are many lemmatizers for English, I've only used morphathough.
Morphais just a big lex-file which you can compile into an executable.
Usage example:
英语有很多词形还原法,不过我只用过morpha。
Morpha只是一个很大的 lex 文件,您可以将其编译为可执行文件。用法示例:
$ cat test.txt
Community
Communities
$ cat test.txt | ./morpha -uc
Community
Community
You can get morpha from http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html
您可以从http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html获取 morpha
回答by Aerodynamika
Hey I don't know if that's perhaps too late, but there is only one PHP stemming script that produces real words: http://phpmorphy.sourceforge.net/–?it took me ages to find it. All other stemmers have to be compiled and even after that they only work according to Porter algorithm, which produces stems, not lemmas (i.e. community = communiti). PhpMorphy one works perfectly well, it's easy to install and initialize, and has English, Russian, German, Ukrainian and Estonian dictionaries. It also comes with a script that you can use to compile other dictionaries. The documentation is in Russian, but put it through Google translate and it should be easy.
嘿,我不知道这是否可能为时已晚,但只有一个 PHP 词干脚本可以生成真实的单词:http: //phpmorphy.sourceforge.net/——我花了很长时间才找到它。所有其他词干分析器都必须编译,甚至在此之后,它们只能根据 Porter 算法工作,该算法产生词干,而不是引理(即 community = community)。PhpMorphy one 运行良好,易于安装和初始化,并有英语、俄语、德语、乌克兰语和爱沙尼亚语词典。它还带有一个脚本,您可以使用它来编译其他词典。文档是俄文的,但通过谷歌翻译应该很容易。

