如何在 python 中使用正则表达式进行多次替换?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15175142/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:32:29  来源:igfitidea点击:

How can I do multiple substitutions using regex in python?

pythonregexstring

提问by Euridice01

I can use this code below to create a new file with the substitution of awith aausing regular expressions.

我可以使用下面的这段代码来创建一个新文件,aaa使用正则表达式替换with 。

import re

with open("notes.txt") as text:
    new_text = re.sub("a", "aa", text.read())
    with open("notes2.txt", "w") as result:
        result.write(new_text)

I was wondering do I have to use this line, new_text = re.sub("a", "aa", text.read()), multiple times but substitute the string for others letters that I want to change in order to change more than one letter in my text?

我想知道我是否必须new_text = re.sub("a", "aa", text.read())多次使用这一行,但将字符串替换为我想要更改的其他字母,以便更改文本中的多个字母?

That is, so a-->aa,b--> bband c--> cc.

也就是说,a--> aab-->bbc--> cc

So I have to write that line for all the letters I want to change or is there an easier way. Perhaps to create a "dictionary" of translations. Should I put those letters into an array? I'm not sure how to call on them if I do.

所以我必须为我想要更改的所有字母写那行,或者有更简单的方法。也许是为了创建一个翻译“词典”。我应该将这些字母放入数组中吗?如果我这样做,我不知道如何打电话给他们。

采纳答案by Emmett Butler

The answer proposed by @nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python's built-in data structures and anonymous function feature.

@nhahtdh 提出的答案是有效的,但我认为与规范示例相比,pythonic 更少,后者使用的代码比他的正则表达式操作更不透明,并利用了 python 的内置数据结构和匿名函数功能。

A dictionary of translations makes sense in this context. In fact, that's how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/)

在这种情况下,翻译词典是有意义的。事实上,这就是 Python Cookbook 的做法,如本例所示(从 ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/复制)

import re 

def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

if __name__ == "__main__": 

  text = "Larry Wall is the creator of Perl"

  dict = {
    "Larry Wall" : "Guido van Rossum",
    "creator" : "Benevolent Dictator for Life",
    "Perl" : "Python",
  } 

  print multiple_replace(dict, text)

So in your case, you could make a dict trans = {"a": "aa", "b": "bb"}and then pass it into multiple_replacealong with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.subto perform the translation dictionary lookup.

因此,在您的情况下,您可以制作一个 dict trans = {"a": "aa", "b": "bb"},然后将其multiple_replace与您要翻译的文本一起传递。基本上,该函数所做的就是创建一个巨大的正则表达式,其中包含要翻译的所有正则表达式,然后当找到一个时,传递一个 lambda 函数regex.sub来执行翻译字典查找。

You could use this function while reading from your file, for example:

您可以在读取文件时使用此函数,例如:

with open("notes.txt") as text:
    new_text = multiple_replace(replacements, text.read())
with open("notes2.txt", "w") as result:
    result.write(new_text)

I've actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.

我实际上已经在生产中使用了这种确切的方法,在这种情况下,我需要将一年中的几个月从捷克语翻译成英语以进行网络抓取任务。

As @nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.

正如@nhahtdh 指出的那样,这种方法的一个缺点是它不是无前缀的:作为其他字典键前缀的字典键会导致该方法中断。

回答by nhahtdh

You can use capturing group and backreference:

您可以使用捕获组和反向引用:

re.sub(r"([characters])", r"", text.read())

Put characters that you want to double up in between []. For the case of lower case a, b, c:

将您想要加倍的字符放在 之间[]。对于小写a, b, 的情况c

re.sub(r"([abc])", r"", text.read())

In the replacement string, you can refer to whatever matched by a capturing group ()with \nnotation where nis some positiveinteger (0 excluded). \1refers to the first capturing group. There is another notation \g<n>where ncan be any non-negative integer (0 allowed); \g<0>will refer to the whole text matched by the expression.

在替换字符串中,您可以引用捕获组匹配的任何内容,并()带有\n符号 wheren是某个整数(不包括 0)。\1指第一个捕获组。还有另一种表示法\g<n>,其中n可以是任何非负整数(允许为 0);\g<0>将引用与表达式匹配的整个文本。



If you want to double up all characters except new line:

如果要将除新行以外的所有字符加倍:

re.sub(r"(.)", r"", text.read())


If you want to double up all characters (new line included):

如果要将所有字符加倍(包括新行):

re.sub(r"(.)", r"", text.read(), 0, re.S)

回答by Leo

Using tips from how to make a 'stringy' class, we can make an object identical to a string but for an extra submethod:

使用如何创建“字符串”类的技巧,我们可以创建一个与字符串相同的对象,但需要一个额外的sub方法:

import re
class Substitutable(str):
  def __new__(cls, *args, **kwargs):
    newobj = str.__new__(cls, *args, **kwargs)
    newobj.sub = lambda fro,to: Substitutable(re.sub(fro, to, newobj))
    return newobj

This allows to use the builder pattern, which looks nicer, but works only for a pre-determined number of substitutions. If you use it in a loop, there is no point creating an extra class anymore. E.g.

这允许使用看起来更好的构建器模式,但仅适用于预定数量的替换。如果你在循环中使用它,就没有必要再创建一个额外的类了。例如

>>> h = Substitutable('horse')
>>> h
'horse'
>>> h.sub('h', 'f')
'forse'
>>> h.sub('h', 'f').sub('f','h')
'horse'

回答by Jordan McBain

I found I had to modify Emmett J. Butler's code by changing the lambda function to use myDict.get(mo.group(1),mo.group(1)). The original code wasn't working for me; using myDict.get() also provides the benefit of a default value if a key is not found.

我发现我必须通过将 lambda 函数更改为使用 myDict.get(mo.group(1),mo.group(1)) 来修改 Emmett J. Butler 的代码。原始代码对我不起作用;如果未找到键,使用 myDict.get() 还提供默认值的好处。

OIDNameContraction = {
                                'Fucntion':'Func',
                                'operated':'Operated',
                                'Asist':'Assist',
                                'Detection':'Det',
                                'Control':'Ctrl',
                                'Function':'Func'
}

replacementDictRegex = re.compile("(%s)" % "|".join(map(re.escape, OIDNameContraction.keys())))

oidDescriptionStr = replacementDictRegex.sub(lambda mo:OIDNameContraction.get(mo.group(1),mo.group(1)), oidDescriptionStr)

回答by George Pipis

You can use the pandaslibrary and the replacefunction. I represent one example with five replacements:

您可以使用pandas库和replace函数。我用五个替换来代表一个例子:

df = pd.DataFrame({'text': ['Billy is going to visit Rome in November', 'I was born in 10/10/2010', 'I will be there at 20:00']})

to_replace=['Billy','Rome','January|February|March|April|May|June|July|August|September|October|November|December', '\d{2}:\d{2}', '\d{2}/\d{2}/\d{4}']
replace_with=['name','city','month','time', 'date']

print(df.text.replace(to_replace, replace_with, regex=True))

And the modified text is:

修改后的文字是:

0    name is going to visit city in month
1                      I was born in date
2                 I will be there at time

You can find the example here

您可以在此处找到示例

回答by Hamid Zaree

If you dealing with files, I have a simple python code about this problem. More info here.

如果你处理文件,我有一个关于这个问题的简单 python 代码。更多信息在这里

import re 

 def multiple_replace(dictionary, text):
  # Create a regular expression  from the dictionaryary keys

  regex = re.compile("(%s)" % "|".join(map(re.escape, dictionary.keys())))

  # For each match, look-up corresponding value in dictionaryary
  String = lambda mo: dictionary[mo.string[mo.start():mo.end()]]
  return regex.sub(String , text)


if __name__ == "__main__":

dictionary = {
    "Wiley Online Library" : "Wiley",
    "Chemical Society Reviews" : "Chem. Soc. Rev.",
} 

with open ('LightBib.bib', 'r') as Bib_read:
    with open ('Abbreviated.bib', 'w') as Bib_write:
        read_lines = Bib_read.readlines()
        for rows in read_lines:
            #print(rows)
            text = rows
            new_text = multiple_replace(dictionary, text)
            #print(new_text)
            Bib_write.write(new_text)

回答by Eric

None of the other solutions work if your patterns are themselves regexes.

如果您的模式本身是正则表达式,则其他任何解决方案都不起作用。

For that, you need:

为此,您需要:

def multi_sub(pairs, s):
    def repl_func(m):
        # only one group will be present, use the corresponding match
        return next(
            repl
            for (patt, repl), group in zip(pairs, m.groups())
            if group is not None
        )
    pattern = '|'.join("({})".format(patt) for patt, _ in pairs)
    return re.sub(pattern, repl_func, s)

Which can be used as:

可以用作:

>>> multi_sub([
...     ('a+b', 'Ab'),
...     ('b', 'B'),
...     ('a+', 'A.'),
... ], "aabbaa")  # matches as (aab)(b)(aa)
'AbBA.'

Note that this solution does not allow you to put capturing groups in your regexes, or use them in replacements.

请注意,此解决方案不允许您在正则表达式中放置捕获组,或在替换中使用它们。