python 实现分词器的 Pythonic 方式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/691148/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:39:08  来源:igfitidea点击:

Pythonic way to implement a tokenizer

pythoncoding-styletokenize

提问by Peter

I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice?

我将在 Python 中实现一个分词器,我想知道您是否可以提供一些样式建议?

I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best practices.

我之前已经在 C 和 Java 中实现了一个标记器,所以我对这个理论很好,我只是想确保我遵循 pythonic 风格和最佳实践。

Listing Token Types:

列出令牌类型:

In Java, for example, I would have a list of fields like so:

例如,在 Java 中,我会有一个像这样的字段列表:

public static final int TOKEN_INTEGER = 0

But, obviously, there's no way (I think) to declare a constant variable in Python, so I could just replace this with normal variable declarations but that doesn't strike me as a great solution since the declarations could be altered.

但是,显然,没有办法(我认为)在 Python 中声明一个常量变量,所以我可以用普通的变量声明替换它,但这并没有让我觉得这是一个很好的解决方案,因为声明可以改变。

Returning Tokens From The Tokenizer:

从 Tokenizer 返回令牌:

Is there a better alternative to just simply returning a list of tuples e.g.

有没有更好的替代方法来简单地返回一个元组列表,例如

[ (TOKEN_INTEGER, 17), (TOKEN_STRING, "Sixteen")]?

Cheers,

干杯,

Pete

皮特

回答by AKX

There's an undocumented class in the remodule called re.Scanner. It's very straightforward to use for a tokenizer:

re模块中有一个未记录的类,名为re.Scanner. 用于分词器非常简单:

import re
scanner=re.Scanner([
  (r"[0-9]+",       lambda scanner,token:("INTEGER", token)),
  (r"[a-z_]+",      lambda scanner,token:("IDENTIFIER", token)),
  (r"[,.]+",        lambda scanner,token:("PUNCTUATION", token)),
  (r"\s+", None), # None == skip token.
])

results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results

will result in

会导致

[('INTEGER', '45'),
 ('IDENTIFIER', 'pigeons'),
 ('PUNCTUATION', ','),
 ('INTEGER', '23'),
 ('IDENTIFIER', 'cows'),
 ('PUNCTUATION', ','),
 ('INTEGER', '11'),
 ('IDENTIFIER', 'spiders'),
 ('PUNCTUATION', '.')]

I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.

我使用 re.Scanner 只用了几百行就编写了一个非常漂亮的配置/结构化数据格式解析器。

回答by RossFabricant

Python takes a "we're all consenting adults" approach to information hiding. It's OK to use variables as though they were constants, and trust that users of your code won't do something stupid.

Python 采用“我们都同意成年人”的方法来隐藏信息。可以像使用常量一样使用变量,并且相信您的代码的用户不会做一些愚蠢的事情。

回答by Ber

In many situations, exp. when parsing long input streams, you may find it more useful to implement you tokenizer as a generator function. This way you can easily iterate over all the tokens without the need for lots of memory to build the list of tokens first.

在许多情况下,exp。在解析长输入流时,您可能会发现将标记器实现为生成器函数更有用。通过这种方式,您可以轻松地遍历所有标记,而无需大量内存来首先构建标记列表。

For generator see the original proposalor other online docs

对于生成器,请参阅原始提案或其他在线文档

回答by Peter

Thanks for your help, I've started to bring these ideas together, and I've come up with the following. Is there anything terribly wrong with this implementation (particularly I'm concerned about passing a file object to the tokenizer):

感谢您的帮助,我已开始将这些想法整合在一起,并提出以下建议。这个实现有什么大问题吗(特别是我担心将文件对象传递给标记器):

class Tokenizer(object):

  def __init__(self,file):
     self.file = file

  def __get_next_character(self):
      return self.file.read(1)

  def __peek_next_character(self):
      character = self.file.read(1)
      self.file.seek(self.file.tell()-1,0)
      return character

  def __read_number(self):
      value = ""
      while self.__peek_next_character().isdigit():
          value += self.__get_next_character()
      return value

  def next_token(self):
      character = self.__peek_next_character()

      if character.isdigit():
          return self.__read_number()

回答by S.Lott

"Is there a better alternative to just simply returning a list of tuples?"

“除了简单地返回一个元组列表,还有更好的选择吗?”

Nope. It works really well.

不。它真的很好用。

回答by Cybis

"Is there a better alternative to just simply returning a list of tuples?"

“除了简单地返回一个元组列表,还有更好的选择吗?”

That's the approach used by the "tokenize" module for parsing Python source code. Returning a simple list of tuples can work very well.

这就是“tokenize”模块用于解析 Python 源代码的方法。返回一个简单的元组列表可以很好地工作。

回答by Giulio Piancastelli

I have recently built a tokenizer, too, and passed through some of your issues.

我最近也构建了一个标记器,并解决了您的一些问题。

Token types are declared as "constants", i.e. variables with ALL_CAPS names, at the module level. For example,

标记类型在模块级别被声明为“常量”,即具有 ALL_CAPS 名称的变量。例如,

_INTEGER = 0x0007
_FLOAT = 0x0008
_VARIABLE = 0x0009

and so on. I have used an underscore in front of the name to point out that somehow those fields are "private" for the module, but I really don't know if this is typical or advisable, not even how much Pythonic. (Also, I'll probably ditch numbers in favour of strings, because during debugging they are much more readable.)

等等。我在名称前面使用了下划线来指出这些字段以某种方式对模块来说是“私有的”,但我真的不知道这是典型的还是可取的,甚至不知道有多少 Pythonic。(此外,我可能会放弃数字而使用字符串,因为在调试期间它们更具可读性。)

Tokens are returned as named tuples.

令牌作为命名元组返回。

from collections import namedtuple
Token = namedtuple('Token', ['value', 'type'])
# so that e.g. somewhere in a function/method I can write...
t = Token(n, _INTEGER)
# ...and return it properly

I have used named tuples because the tokenizer's client code (e.g. the parser) seems a little clearer while using names (e.g. token.value) instead of indexes (e.g. token[0]).

我使用了命名元组,因为标记生成器的客户端代码(例如解析器)在使用名称(例如 token.value)而不是索引(例如 token[0])时看起来更清晰一些。

Finally, I've noticed that sometimes, especially writing tests, I prefer to pass a string to the tokenizer instead of a file object. I call it a "reader", and have a specific method to open it and let the tokenizer access it through the same interface.

最后,我注意到有时,尤其是编写测试时,我更喜欢将字符串传递给标记器而不是文件对象。我称它为“阅读器”,并有一个特定的方法来打开它并让标记器通过相同的界面访问它。

def open_reader(self, source):
    """
    Produces a file object from source.
    The source can be either a file object already, or a string.
    """
    if hasattr(source, 'read'):
        return source
    else:
        from io import StringIO
        return StringIO(source)

回答by Ber

When I start something new in Python I usually look first at some modules or libraries to use. There's 90%+ chance that there already is somthing available.

当我开始使用 Python 做一些新的事情时,我通常首先查看一些要使用的模块或库。有 90% 以上的机会已经有可用的东西。

For tokenizers and parsers this is certainly so. Have you looked at PyParsing?

对于分词器和解析器来说,当然是这样。你看过PyParsing吗?

回答by ThomasH

I've implemented a tokenizer for a C-like programming language. What I did was to split up the creation of tokens into two layers:

我已经为类似 C 的编程语言实现了一个标记器。我所做的是将令牌的创建分成两层:

  • a surface scanner: This one actually reads the text and uses regular expression to split it up into only the most primitve tokens (operators, identifiers, numbers,...); this one yields tuples (tokenname, scannedstring, startpos, endpos).
  • a tokenizer: This consumes the tuples from the first layer, turning them into token objects (named tuples would do as well, I think). Its purpose is to detect some long-range dependencies in the token stream, particularly strings (with their opening and closing quotes) and comments (with their opening an closing lexems; - yes, I wanted to retain comments!) and coerce them into single tokens. The resulting stream of token objects is then returned to a consuming parser.
  • 一个表面扫描:这实际上是一个阅读的文本,并使用正则表达式来它分成只有最primitve令牌(运营商,标识符,数字,...); 这个产生元组(令牌名、扫描字符串、起始位置、结束位置)。
  • 一个标记者:这会消耗从第一层的元组,把它们变成记号对象(命名为元组将做的一样好,我认为)。它的目的是检测令牌流中的一些长期依赖项,特别是字符串(带有它们的开始和结束引号)和注释(带有它们的开始和结束词;-是的,我想保留注释!)并将它们强制转换为单个令牌。然后将生成的令牌对象流返回给消费解析器。

Both are generators. The benefits of this approach were:

两者都是发电机。这种方法的好处是:

  • Reading of the raw text is done only in the most primitive way, with simple regexps - fast and clean.
  • The second layer is already implemented as a primitive parser, to detect string literals and comments - re-use of parser technology.
  • You don't have to strain the surface scanner with complex detections.
  • But the real parser gets tokens on the semantic level of the language to be parsed (again strings, comments).
  • 原始文本的读取仅以最原始的方式完成,使用简单的正则表达式 - 快速而干净。
  • 第二层已经实现为一个原始解析器,用于检测字符串文字和注释——解析器技术的重用。
  • 您不必通过复杂的检测使表面扫描仪紧张。
  • 但是真正的解析器在要解析的语言的语义级别上获取标记(再次是字符串、注释)。

I feel quite happy with this layered approach.

我对这种分层方法感到非常满意。

回答by Dexygen

I'd turn to the excellent Text Processing in Pythonby David Mertz

我会求助于David Mertz出色的Python 中文本处理