高效的上下文无关语法解析器,最好是 Python 友好的

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4543008/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 16:15:10  来源:igfitidea点击:

Efficient Context-Free Grammar parser, preferably Python-friendly

pythonparsingnlpgrammarnltk

提问by Max Shawabkeh

I am in need of parsing a small subset of English for one of my project, described as a context-free grammar with (1-level) feature structures (example) and I need to do it efficiently .

我需要为我的一个项目解析一小部分英语,描述为具有(1 级)特征结构(示例)的上下文无关语法,我需要高效地完成它。

Right now I'm using NLTK's parser which produces the right output but is very slow. For my grammar of ~450 fairly ambiguous non-lexicon rules and half a million lexical entries, parsing simple sentences can take anywhere from 2 to 30 seconds, depending it seems on the number of resulting trees. Lexical entries have little to no effect on performance.

现在我正在使用NLTK的解析器,它产生正确的输出但速度很慢。对于我的约 450 个相当模糊的非词典规则和 50 万个词汇条目的语法,解析简单的句子可能需要 2 到 30 秒,这取决于结果树的数量。词法条目对性能几乎没有影响。

Another problem is that loading the (25MB) grammar+lexicon at the beginning can take up to a minute.

另一个问题是在开头加载 (25MB) 语法+词典可能需要长达一分钟的时间。

From what I can find in literature, the running time of the algorithm used to parse such a grammar (Earley or CKY) should be linear to the size of the grammar and cubic to the size of the input token list. My experience with NLTK indicates that ambiguity is what hurts the performance most, not the absolute size of the grammar.

从我在文献中可以找到的内容来看,用于解析这种语法(Earley 或 CKY)的算法的运行时间应该与语法的大小成线性关系,与输入标记列表的大小成立方关系。我对 NLTK 的经验表明,歧义是对性能影响最大的,而不是语法的绝对大小。

So now I'm looking for a CFG parser to replace NLTK. I've been considering PLYbut I can't tell whether it supports feature structures in CFGs, which are required in my case, and the examples I've seen seem to be doing a lot of procedural parsing rather than just specifying a grammar. Can anybody show me an example of PLY both supporting feature structs and using a declarative grammar?

所以现在我正在寻找一个 CFG 解析器来替换 NLTK。我一直在考虑PLY,但我不知道它是否支持 CFG 中的特征结构,这在我的情况下是必需的,而且我看到的示例似乎在进行大量程序解析,而不仅仅是指定语法。任何人都可以向我展示支持功能结构和使用声明性语法的 PLY 示例吗?

I'm also fine with any other parser that can do what I need efficiently. A Python interface is preferable but not absolutely necessary.

我也可以使用任何其他可以高效完成我需要的解析器。Python 接口更可取,但并非绝对必要。

回答by duffymo

I think ANTLR is the best parser-generator that I know of for Java. I don't know if Jython would provide you a good way for Python and Java to interact.

我认为 ANTLR 是我所知道的最好的 Java 解析器生成器。不知道Jython 会不会为你提供一个很好的Python 和Java 交互方式。

回答by Apalala

By all means take a look at Pyparsing. It's the most pythonic implementations of parsing I've come across, and it's a great design from a purely academic standpoint.

一定要看看Pyparsing。这是我遇到的最 Pythonic 的解析实现,从纯粹的学术角度来看,这是一个很棒的设计。

I used both ANTLRand JavaCCto teach translator and compiler theory at a local university. They're both good and mature, but I wouldn't use them in a Python project.

我使用ANTLRJavaCC在当地大学教授翻译器和编译器理论。它们既好又成熟,但我不会在 Python 项目中使用它们。

That said, unlike programming languages, natural languages are much more about the semantics than about the syntax, so you could be much better off skipping the learning curves of existing parsing tools, going with a home-brewed (top-down, backtracking, unlimited lookahead) lexical analyzer and parser, and spending the bulk of your time writing the code that figures out what a parsed, but ambiguous, natural-language sentence means.

也就是说,与编程语言不同,自然语言更多的是关于语义而不是语法,所以你最好跳过现有解析工具的学习曲线,使用自制的(自顶向下、回溯、无限lookahead) 词法分析器和解析器,并花费大量时间编写代码来弄清楚一个已解析但模棱两可的自然语言句子的含义。

回答by Binary Phile

If it can be expressed as a PEGlanguage (I don't think all CFGs can, but supposedly many can), then you might use pyPEG, which is supposed to be linear-time when using a packrat parsing implementation (although potentially prohibitive on memory usage).

如果它可以表示为PEG语言(我不认为所有 CFG 都可以,但据说很多都可以),那么您可以使用pyPEG,它在使用Packrat解析实现时应该是线性时间的(尽管可能会禁止内存使用情况)。

I don't have any experience with it as I am just starting to research parsing and compilation again after a long time away from it, but I am reading some good buzz about this relatively up-to-date technique. YMMV.

我对它没有任何经验,因为我在离开它很长一段时间后才开始再次研究解析和编译,但我正在阅读一些关于这种相对最新技术的好消息。天啊。

回答by TryPyPy

Try running it on PyPy, it might be a lot faster.

尝试在 PyPy 上运行它,它可能会快很多。

回答by PaulMcG

I've used pyparsing for limited vocabulary command parsing, but here is a little framework on top of pyparsing that addresses your posted example:

我已经使用 pyparsing 进行有限词汇命令解析,但这里有一个位于 pyparsing 之上的小框架,用于解决您发布的示例:

from pyparsing import *

transVerb, transVerbPlural, transVerbPast, transVerbProg = (Forward() for i in range(4))
intransVerb, intransVerbPlural, intransVerbPast, intransVerbProg = (Forward() for i in range(4))
singNoun,pluralNoun,properNoun = (Forward() for i in range(3))
singArticle,pluralArticle = (Forward() for i in range(2))
verbProg = transVerbProg | intransVerbProg
verbPlural = transVerbPlural | intransVerbPlural

for expr in (transVerb, transVerbPlural, transVerbPast, transVerbProg,
            intransVerb, intransVerbPlural, intransVerbPast, intransVerbProg,
            singNoun, pluralNoun, properNoun, singArticle, pluralArticle):
    expr << MatchFirst([])

def appendExpr(e1, s):
    c1 = s[0]
    e2 = Regex(r"[%s%s]%s\b" % (c1.upper(), c1.lower(), s[1:]))
    e1.expr.exprs.append(e2)

def makeVerb(s, transitive):
    v_pl, v_sg, v_past, v_prog = s.split()
    if transitive:
        appendExpr(transVerb, v_sg)
        appendExpr(transVerbPlural, v_pl)
        appendExpr(transVerbPast, v_past)
        appendExpr(transVerbProg, v_prog)
    else:
        appendExpr(intransVerb, v_sg)
        appendExpr(intransVerbPlural, v_pl)
        appendExpr(intransVerbPast, v_past)
        appendExpr(intransVerbProg, v_prog)

def makeNoun(s, proper=False):
    if proper:
        appendExpr(properNoun, s)
    else:
        n_sg,n_pl = (s.split() + [s+"s"])[:2]
        appendExpr(singNoun, n_sg)
        appendExpr(pluralNoun, n_pl)

def makeArticle(s, plural=False):
    for ss in s.split():
        if not plural:
            appendExpr(singArticle, ss)
        else:
            appendExpr(pluralArticle, ss)

makeVerb("disappear disappears disappeared disappearing", transitive=False)
makeVerb("walk walks walked walking", transitive=False)
makeVerb("see sees saw seeing", transitive=True)
makeVerb("like likes liked liking", transitive=True)

makeNoun("dog")
makeNoun("girl")
makeNoun("car")
makeNoun("child children")
makeNoun("Kim", proper=True)
makeNoun("Jody", proper=True)

makeArticle("a the")
makeArticle("this every")
makeArticle("the these all some several", plural=True)

transObject = (singArticle + singNoun | properNoun | Optional(pluralArticle) + pluralNoun | verbProg | "to" + verbPlural)
sgSentence = (singArticle + singNoun | properNoun) + (intransVerb | intransVerbPast | (transVerb | transVerbPast) + transObject)
plSentence = (Optional(pluralArticle) + pluralNoun) + (intransVerbPlural | intransVerbPast | (transVerbPlural |transVerbPast) + transObject)

sentence = sgSentence | plSentence


def test(s):
    print s
    try:
        print sentence.parseString(s).asList()
    except ParseException, pe:
        print pe

test("Kim likes cars")
test("The girl saw the dog")
test("The dog saw Jody")
test("Kim likes walking")
test("Every girl likes dogs")
test("All dogs like children")
test("Jody likes to walk")
test("Dogs like walking")
test("All dogs like walking")
test("Every child likes Jody")

Prints:

印刷:

Kim likes cars
['Kim', 'likes', 'cars']
The girl saw the dog
['The', 'girl', 'saw', 'the', 'dog']
The dog saw Jody
['The', 'dog', 'saw', 'Jody']
Kim likes walking
['Kim', 'likes', 'walking']
Every girl likes dogs
['Every', 'girl', 'likes', 'dogs']
All dogs like children
['All', 'dogs', 'like', 'children']
Jody likes to walk
['Jody', 'likes', 'to', 'walk']
Dogs like walking
['Dogs', 'like', 'walking']
All dogs like walking
['All', 'dogs', 'like', 'walking']
Every child likes Jody
['Every', 'child', 'likes', 'Jody']

This is likely to get slow as you expand the vocabulary. Half a million entries? I thought that a reasonable functional vocabulary was on the order of 5-6 thousand words. And you will be pretty limited in the sentence structures that you can handle - natural language is what NLTK is for.

随着您扩大词汇量,这可能会变慢。一百万个条目?我认为一个合理的功能性词汇量在 5-6 千字左右。而且您可以处理的句子结构将非常有限 - 自然语言是 NLTK 的用途。

回答by Apalala

Tooling aside...

工具放在一边...

You may remember from theory that there are infinite grammars that define the same language. There are criteria for categorizing grammars and determining which is the "canonical" or "minimal" one for a given language, but in the end, the "best" grammar is the one that's more convenient for the task and tools at hand (remember the transformations of CFGs into LL and LR grammars?).

您可能还记得,从理论上讲,定义同一种语言的语法是无限的。有一些标准可以对语法进行分类并确定给定语言的“规范”或“最小”语法,但最终,“最佳”语法是对手头的任务和工具更方便的语法(请记住CFG 到 LL 和 LR 文法的转换?)。

Then, you probably don't need a huge lexicon to parse an sentence in English. There's a lot to be known about a word in languages like German or Latin (or even Spanish), but not in the many times arbitrary and ambiguos English. You should be able to get away with a small lexicon that contains only the key words necessary to arrive to the structure of a sentence. At any rate, the grammar you choose, no matter its size, can be cached in a way that thee tooling can directly use it (i.e., you can skip parsing the grammar).

然后,您可能不需要庞大的词典来解析英语句子。在德语或拉丁语(甚至西班牙语)等语言中,有很多关于一个单词的知识,但在很多时候,任意和含糊不清的英语中却没有。您应该能够使用一个小词典,其中仅包含到达句子结构所需的关键词。无论如何,您选择的语法,无论其大小如何,都可以以一种工具可以直接使用它的方式进行缓存(即,您可以跳过语法分析)。

Given that, it could be a good idea to take a look at a simpler parser already worked on by someone else. There must be thousands of those in the literature. Studying different approaches will let you evaluate your own, and may lead you to adopt someone else's.

鉴于此,看看其他人已经使用过的更简单的解析器可能是个好主意。文献中肯定有成千上万的人。研究不同的方法会让你评估自己的方法,并可能引导你采用别人的方法。

Finally, as I already mentioned, interpreting natural languages is much more about artificial intelligence than about parsing. Because structure determines meaning and meaning determines structure you have to play with both at the same time. An approach I've seen in the literature since the '80s is to let different specialized agents take shots at solving the problem against a "blackboard". With that approach syntatic and semantic analysis happen concurrently.

最后,正如我已经提到的,解释自然语言更多的是关于人工智能而不是解析。因为结构决定意义,意义决定结构,所以你必须同时玩弄两者。自 80 年代以来,我在文献中看到的一种方法是让不同的专业代理针对“黑板”解决问题。使用这种方法,句法和语义分析同时发生。

回答by Jerry

Somewhat late on this, but here are two more options for you:

这有点晚了,但这里还有两个选项供您选择:

Sparkis a Earley parser written in Python.

Spark是一个用 Python 编写的 Earley 解析器。

Elkhoundis a GLR parser written in C++ Elkhound uses a Bison like syntax

Elkhound是用 C++ 编写的 GLR 解析器 Elkhound 使用类似 Bison 的语法

回答by Andreas

I would recommend using bitpar, a very efficient PCFG parser written in C++. I've written a shell-based Python wrapper for it, see https://github.com/andreasvc/eodop/blob/master/bitpar.py

我建议使用 bitpar,这是一种用 C++ 编写的非常高效的 PCFG 解析器。我为它编写了一个基于 shell 的 Python 包装器,请参阅https://github.com/andreasvc/eodop/blob/master/bitpar.py