我们如何将这种正则表达式习语从Perl转换为Python?

时间:2020-03-06 14:36:41  来源:igfitidea点击:

大约一年前,我从Perl切换到Python,并且没有回头。我发现只有一个成语比起Python可以更容易地在Perl中完成:

if ($var =~ /foo(.+)/) {
  # do something with 
} elsif ($var =~ /bar(.+)/) {
  # do something with 
} elsif ($var =~ /baz(.+)/) {
  # do something with 
}

由于if语句不断嵌套,因此相应的Python代码不太好用:

m = re.search(r'foo(.+)', var)
if m:
  # do something with m.group(1)
else:
  m = re.search(r'bar(.+)', var)
  if m:
    # do something with m.group(1)
  else:
    m = re.search(r'baz(.+)', var)
    if m:
      # do something with m.group(2)

有没有人有优雅的方法可以在Python中重现此模式?我已经看到使用了匿名函数调度表,但是对于少量的正则表达式来说,这些表对我来说似乎有点笨拙……

解决方案

使用命名组和调度表:

r = re.compile(r'(?P<cmd>foo|bar|baz)(?P<data>.+)')

def do_foo(data):
    ...

def do_bar(data):
    ...

def do_baz(data):
    ...

dispatch = {
    'foo': do_foo,
    'bar': do_bar,
    'baz': do_baz,
}

m = r.match(var)
if m:
    dispatch[m.group('cmd')](m.group('data'))

进行一点自省,我们可以自动生成regexp和调度表。

或者,根本不使用正则表达式的内容:

prefix, data = var[:3], var[3:]
if prefix == 'foo':
    # do something with data
elif prefix == 'bar':
    # do something with data
elif prefix == 'baz':
    # do something with data
else:
    # do something with var

是否合适取决于实际问题。别忘了,正则表达式并不是Perl中的瑞士军刀; Python具有进行字符串操作的不同构造。

def find_first_match(string, *regexes):
    for regex, handler in regexes:
        m = re.search(regex, string):
        if m:
            handler(m)
            return
    else:
        raise ValueError

find_first_match(
    foo, 
    (r'foo(.+)', handle_foo), 
    (r'bar(.+)', handle_bar), 
    (r'baz(.+)', handle_baz))

为了加快速度,可以在内部将所有正则表达式转换为一个并动态创建调度程序。理想情况下,这将变成一个类。

是的,这很烦人。也许这将适合情况。

import re

class ReCheck(object):
    def __init__(self):
        self.result = None
    def check(self, pattern, text):
        self.result = re.search(pattern, text)
        return self.result

var = 'bar stuff'
m = ReCheck()
if m.check(r'foo(.+)',var):
    print m.result.group(1)
elif m.check(r'bar(.+)',var):
    print m.result.group(1)
elif m.check(r'baz(.+)',var):
    print m.result.group(1)

编辑:布莱恩正确地指出,我的第一次尝试没有用。不幸的是,这种尝试时间更长。

我建议这样做,因为它使用最少的正则表达式来实现目标。它仍然是功能代码,但不比旧Perl更糟糕。

import re
var = "barbazfoo"

m = re.search(r'(foo|bar|baz)(.+)', var)
if m.group(1) == 'foo':
    print m.group(1)
    # do something with m.group(1)
elif m.group(1) == "bar":
    print m.group(1)
    # do something with m.group(1)
elif m.group(1) == "baz":
    print m.group(2)
    # do something with m.group(2)

r"""
This is an extension of the re module. It stores the last successful
match object and lets you access it's methods and attributes via
this module.

This module exports the following additional functions:
    expand  Return the string obtained by doing backslash substitution on a
            template string.
    group   Returns one or more subgroups of the match.
    groups  Return a tuple containing all the subgroups of the match.
    start   Return the indices of the start of the substring matched by
            group.
    end     Return the indices of the end of the substring matched by group.
    span    Returns a 2-tuple of (start(), end()) of the substring matched
            by group.

This module defines the following additional public attributes:
    pos         The value of pos which was passed to the search() or match()
                method.
    endpos      The value of endpos which was passed to the search() or
                match() method.
    lastindex   The integer index of the last matched capturing group.
    lastgroup   The name of the last matched capturing group.
    re          The regular expression object which as passed to search() or
                match().
    string      The string passed to match() or search().
"""

import re as re_

from re import *
from functools import wraps

__all__ = re_.__all__ + [ "expand", "group", "groups", "start", "end", "span",
        "last_match", "pos", "endpos", "lastindex", "lastgroup", "re", "string" ]

last_match = pos = endpos = lastindex = lastgroup = re = string = None

def _set_match(match=None):
    global last_match, pos, endpos, lastindex, lastgroup, re, string
    if match is not None:
        last_match = match
        pos = match.pos
        endpos = match.endpos
        lastindex = match.lastindex
        lastgroup = match.lastgroup
        re = match.re
        string = match.string
    return match

@wraps(re_.match)
def match(pattern, string, flags=0):
    return _set_match(re_.match(pattern, string, flags))

@wraps(re_.search)
def search(pattern, string, flags=0):
    return _set_match(re_.search(pattern, string, flags))

@wraps(re_.findall)
def findall(pattern, string, flags=0):
    matches = re_.findall(pattern, string, flags)
    if matches:
        _set_match(matches[-1])
    return matches

@wraps(re_.finditer)
def finditer(pattern, string, flags=0):
    for match in re_.finditer(pattern, string, flags):
        yield _set_match(match)

def expand(template):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.expand(template)

def group(*indices):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.group(*indices)

def groups(default=None):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.groups(default)

def groupdict(default=None):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.groupdict(default)

def start(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.start(group)

def end(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.end(group)

def span(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.span(group)

del wraps  # Not needed past module compilation

例如:

if gre.match("foo(.+)", var):
  # do something with gre.group(1)
elif gre.match("bar(.+)", var):
  # do something with gre.group(1)
elif gre.match("baz(.+)", var):
  # do something with gre.group(1)

感谢这个其他的SO问题:

import re

class DataHolder:
    def __init__(self, value=None, attr_name='value'):
        self._attr_name = attr_name
        self.set(value)
    def __call__(self, value):
        return self.set(value)
    def set(self, value):
        setattr(self, self._attr_name, value)
        return value
    def get(self):
        return getattr(self, self._attr_name)

string = u'test bar 123'
save_match = DataHolder(attr_name='match')
if save_match(re.search('foo (\d+)', string)):
    print "Foo"
    print save_match.match.group(1)
elif save_match(re.search('bar (\d+)', string)):
    print "Bar"
    print save_match.match.group(1)
elif save_match(re.search('baz (\d+)', string)):
    print "Baz"
    print save_match.match.group(1)

这是我解决此问题的方法:

matched = False;

m = re.match("regex1");
if not matched and m:
    #do something
    matched = True;

m = re.match("regex2");
if not matched and m:
    #do something else
    matched = True;

m = re.match("regex3");
if not matched and m:
    #do yet something else
    matched = True;

不像原始图案那么干净。但是,它很简单,直接,不需要额外的模块,也无需更改原始正则表达式。

如何使用字典呢?

match_objects = {}

if match_objects.setdefault( 'mo_foo', re_foo.search( text ) ):
  # do something with match_objects[ 'mo_foo' ]

elif match_objects.setdefault( 'mo_bar', re_bar.search( text ) ):
  # do something with match_objects[ 'mo_bar' ]

elif match_objects.setdefault( 'mo_baz', re_baz.search( text ) ):
  # do something with match_objects[ 'mo_baz' ]

...

但是,必须确保没有重复的match_objects字典键(mo_foo,mo_bar,...),最好通过给每个正则表达式指定自己的名称并相应地命名match_objects键,否则match_objects.setdefault()方法将返回现有的match对象而不是通过运行re_xxx.search(text)创建新的匹配对象。