Python 替换单行中的所有正则表达式匹配项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4338032/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 15:18:16  来源:igfitidea点击:

replacing all regex matches in single line

pythonregex

提问by damir

I have dynamic regexp in which I don't know in advance how many groups it has I would like to replace all matches with xml tags

我有动态正则表达式,我事先不知道它有多少组我想用 xml 标签替换所有匹配项

example

例子

re.sub("(this).*(string)","this is my string",'<markup>\anygroup</markup>')
>> "<markup>this</markup> is my <markup>string</markup>"

is that even possible in single line?

这甚至可能在单行中吗?

采纳答案by Marius Gedminas

For a constant regexp like in your example, do

对于像您的示例中那样的常量正则表达式,请执行

re.sub("(this)(.*)(string)",
       r'<markup></markup><markup></markup>',
       text)

Note that you need to enclose .* in parentheses as well if you don't want do lose it.

请注意,如果您不想丢失它,还需要将 .* 括在括号中。

Now if you don't know what the regexp looks like, it's more difficult, but should be doable.

现在,如果您不知道正则表达式是什么样子,那就更难了,但应该是可行的。

pattern = "(this)(.*)(string)"
re.sub(pattern,
       lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 0
                         else s for n, s in enumerate(m.groups())),
       text)

If the first thing matched by your pattern doesn't necessarily have to be marked up, use this instead, with the first group optionally matching some prefix text that should be left alone:

如果您的模式匹配的第一件事不一定要标记,请改用它,第一组可以选择匹配一些应该单独留下的前缀文本:

pattern = "()(this)(.*)(string)"
re.sub(pattern,
       lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 1
                         else s for n, s in enumerate(m.groups())),
       text)

You get the idea.

你明白了。

If your regexps are complicated and you're not sure you can make everything part of a group, where only every second group needs to be marked up, you might do something smarter with a more complicated function:

如果您的正则表达式很复杂,并且您不确定是否可以将所有内容都作为一个组的一部分,其中只需要标记每隔一个组,您可以使用更复杂的函数做一些更聪明的事情:

pattern = "(this).*(string)"
def replacement(m):
    s = m.group()
    n_groups = len(m.groups())
    # assume groups do not overlap and are listed left-to-right
    for i in range(n_groups, 0, -1):
        lo, hi = m.span(i)
        s = s[:lo] + '<markup>' + s[lo:hi] + '</markup>' + s[hi:]
    return s
re.sub(pattern, replacement, text)

If you need to handle overlapping groups, you're on your own, but it should be doable.

如果您需要处理重叠的组,您需要自己处理,但这应该是可行的。

回答by Ignacio Vazquez-Abrams

re.sub()will replace everything it can. If you pass it a function for replthen you can do even more.

re.sub()将取代它所能取代的一切。如果你传递给它一个函数,repl那么你可以做更多的事情。

回答by Tim Pietzcker

Yes, this can be done in a single line.

是的,这可以在一行中完成。

>>> re.sub(r"\b(this|string)\b", r"<markup></markup>", "this is my string")
'<markup>this</markup> is my <markup>string</markup>'

\bensures that only complete words are matched.

\b确保只匹配完整的单词。

So if you have a list of words that you need to mark up, you could do the following:

因此,如果您有需要标记的单词列表,则可以执行以下操作:

>>> mywords = ["this", "string", "words"]
>>> myre = r"\b(" + "|".join(mywords) + r")\b"
>>> re.sub(myre, r"<markup></markup>", "this is my string with many words!")
'<markup>this</markup> is my <markup>string</markup> with many <markup>words</markup>!'