Python - 词法分析和标记化

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2358890/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:26:53  来源:igfitidea点击:

Python - lexical analysis and tokenization

pythontransformlexical-analysis

提问by Philip Reynolds

I'm looking to speed along my discovery process here quite a bit, as this is my first venture into the world of lexical analysis. Maybe this is even the wrong path. First, I'll describe my problem:

我希望在这里加快我的发现过程,因为这是我第一次涉足词法分析领域。也许这甚至是错误的道路。首先,我将描述我的问题:

I've got very large properties files (in the order of 1,000 properties), which when distilled, are really just about 15 important properties and the rest can be generated or rarely ever change.

我有非常大的属性文件(大约 1,000 个属性),经过提炼后,实际上只有大约 15 个重要属性,其余的可以生成或很少更改。

So, for example:

因此,例如:

general {
  name = myname
  ip = 127.0.0.1
}

component1 {
   key = value
   foo = bar
}

This is the type of format I want to create to tokenize something like:

这是我要创建的格式类型以标记如下内容:

property.${general.name}blah.home.directory = /blah
property.${general.name}.ip = ${general.ip}
property.${component1}.ip = ${general.ip}
property.${component1}.foo = ${component1.foo}

into

进入

property.mynameblah.home.directory = /blah
property.myname.ip = 127.0.0.1
property.component1.ip = 127.0.0.1
property.component1.foo = bar

Lexical analysis and tokenization sounds like my best route, but this is a very simple form of it. It's a simple grammar, a simple substitution and I'd like to make sure that I'm not bringing a sledgehammer to knock in a nail.

词法分析和标记化听起来是我最好的方法,但这是它的一种非常简单的形式。这是一个简单的语法,一个简单的替换,我想确保我没有带着大锤敲钉子。

I could create my own lexer and tokenizer, or ANTlr is a possibility, but I don't like re-inventing the wheel and ANTlr sounds like overkill.

我可以创建我自己的词法分析器和标记器,或者 ANTlr 是一种可能性,但我不喜欢重新发明轮子,ANTlr 听起来有点矫枉过正。

I'm not familiar with compiler techniques, so pointers in the right direction & code would be most appreciated.

我不熟悉编译器技术,因此将非常感谢指向正确方向和代码的指针。

Note: I can change the input format.

注意:我可以更改输入格式。

采纳答案by Matt Anderson

There's an excellent article on Using Regular Expressions for Lexical Analysisat effbot.org.

有一个关于一个很好的文章使用了词法分析正则表达式effbot.org

Adapting the tokenizer to your problem:

使分词器适应您的问题:

import re

token_pattern = r"""
(?P<identifier>[a-zA-Z_][a-zA-Z0-9_]*)
|(?P<integer>[0-9]+)
|(?P<dot>\.)
|(?P<open_variable>[$][{])
|(?P<open_curly>[{])
|(?P<close_curly>[}])
|(?P<newline>\n)
|(?P<whitespace>\s+)
|(?P<equals>[=])
|(?P<slash>[/])
"""

token_re = re.compile(token_pattern, re.VERBOSE)

class TokenizerException(Exception): pass

def tokenize(text):
    pos = 0
    while True:
        m = token_re.match(text, pos)
        if not m: break
        pos = m.end()
        tokname = m.lastgroup
        tokvalue = m.group(tokname)
        yield tokname, tokvalue
    if pos != len(text):
        raise TokenizerException('tokenizer stopped at pos %r of %r' % (
            pos, len(text)))

To test it, we do:

为了测试它,我们这样做:

stuff = r'property.${general.name}.ip = ${general.ip}'
stuff2 = r'''
general {
  name = myname
  ip = 127.0.0.1
}
'''

print ' stuff '.center(60, '=')
for tok in tokenize(stuff):
    print tok

print ' stuff2 '.center(60, '=')
for tok in tokenize(stuff2):
    print tok

for:

为了:

========================== stuff ===========================
('identifier', 'property')
('dot', '.')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'name')
('close_curly', '}')
('dot', '.')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'ip')
('close_curly', '}')
========================== stuff2 ==========================
('newline', '\n')
('identifier', 'general')
('whitespace', ' ')
('open_curly', '{')
('newline', '\n')
('whitespace', '  ')
('identifier', 'name')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('identifier', 'myname')
('newline', '\n')
('whitespace', '  ')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('integer', '127')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '1')
('newline', '\n')
('close_curly', '}')
('newline', '\n')

回答by Kaleb Pederson

A simple DFA works well for this. You only need a few states:

一个简单的 DFA 可以很好地解决这个问题。你只需要几个状态:

  1. Looking for ${
  2. Seen ${looking for at least one valid character forming the name
  3. Seen at least one valid name character, looking for more name characters or }.
  1. 寻找 ${
  2. 看到${寻找至少一个构成名称的有效字符
  3. 看到至少一个有效名称字符,寻找更多名称字符或}.

If the properties file is order agnostic, you might want a two pass processor to verify that each name resolves correctly.

如果属性文件与顺序无关,您可能需要一个两遍处理器来验证每个名称是否正确解析。

Of course, you then need to write the substitution code, but once you have a list of all the names used, the simplest possible implementation is a find/replace on ${name}with its corresponding value.

当然,您随后需要编写替换代码,但是一旦您获得了所有使用名称的列表,最简单的可能实现就是${name}使用其相应的值进行查找/替换。

回答by zdav

For as simple as your format seems to be, I think a full-on parser/lexer would be way overkill. Seems like a combination of regexes and string manipulation would do the trick.

就像你的格式看起来一样简单,我认为一个完整的解析器/词法分析器会有点矫枉过正。似乎正则表达式和字符串操作的组合可以解决问题。

Another idea is to change the file to something like json or xml and use an existing package.

另一个想法是将文件更改为 json 或 xml 之类的内容并使用现有包。

回答by Dmitry Kochkin

The syntax you provide seems similar to Mako templates engine. I think you could give it a try, it's rather simple API.

您提供的语法似乎类似于Mako 模板引擎。我认为您可以尝试一下,它是相当简单的 API。

回答by danben

If you can change the format of the input files, then you could use a parser for an existing format, such as JSON.

如果您可以更改输入文件的格式,那么您可以使用现有格式的解析器,例如 JSON。

However, from your problem statement it sounds like that isn't the case. So if you want to create a custom lexer and parser, use PLY(Python Lex/Yacc). It is easy to use and works the same as lex/yacc.

但是,从您的问题陈述看来,情况并非如此。因此,如果您想创建自定义词法分析器和解析器,请使用PLY(Python Lex/Yacc)。它易于使用并且与 lex/yacc 的工作方式相同。

Here is a link to an exampleof a calculator built using PLY. Note that everything starting with t_is a lexer rule - defining a valid token - and everything starting with p_is a parser rule that defines a production of the grammar.

这是使用 PLY 构建的计算器示例的链接。请注意,以 开头的所有内容t_都是词法分析器规则 - 定义有效标记 - 并且所有以 开头的内容p_都是定义语法产生式的解析器规则。