Python re.sub():如何用“你”替换所有“你”或“你”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13748674/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 09:32:04  来源:igfitidea点击:

Python re.sub(): how to substitute all 'u' or 'U's with 'you'

pythonregex

提问by user823743

I am doing some text normalization using python and regular expressions. I would like to substitute all 'u'or 'U's with 'you'. Here is what I have done so far:

我正在使用 python 和正则表达式进行一些文本规范化。我想用“你”代替所有的“你”或“你”。这是我到目前为止所做的:

import re
text = 'how are u? umberella u! u. U. U@ U# u '
print re.sub (' [u|U][s,.,?,!,W,#,@ (^a-zA-Z)]', ' you ', text)

The output I get is:

我得到的输出是:

how are you  you berella you  you  you  you  you  you

As you can see the problem is that 'umberella' is changed to 'berella'. Also I want to keep the character that appears after a 'u'. For example I want 'u!' to be changed to 'you!'. Can anyone tell me what I am doing wrong and what is the best way to write the regular expression?

如您所见,问题在于“umberella”更改为“berella”。我也想保留出现在“u”之后的字符。例如我想要“你!” 改为“你!”。谁能告诉我我做错了什么以及编写正则表达式的最佳方法是什么?

采纳答案by Martin Ender

Firstly, why doesn't your solution work. You mix up a lot of concepts. Mostly character classwith other ones. In the first character class you use |which stems from alternation. In character classes you don't need the pipe. Just list all characters (and character ranges) you want:

首先,为什么您的解决方案不起作用。你混淆了很多概念。主要是与其他字符类。在您使用的第一个字符类中,|它源于交替。在字符类中,您不需要管道。只需列出您想要的所有字符(和字符范围):

[Uu]

Or simply write uif you use the case-insensitive modifier. If you write a pipe there, the character class will actually match pipes in your subject string.

或者,u如果您使用不区分大小写的修饰符,则简单地写。如果你在那里写一个管道,字符类实际上将匹配主题字符串中的管道。

Now in the second character class you use the comma to separate your characters for some odd reason. That does also nothing but include commas into the matchable characters. sand Ware probably supposed to be the built-in character classes. Then escape them! Otherwise they will just match literal sand literal W. But then \Walready includes everything else you listed there, so a \Walone (without square brackets) would have been enough. And the last part (^a-zA-Z)also doesn't work, because it will simply include ^, (, )and all letters into the character class. The negation syntax only works for entire character classes like [^a-zA-Z].

现在在第二个字符类中,出于某种奇怪的原因,您使用逗号分隔字符。这也只是在可匹配字符中包含逗号。s并且W可能应该是内置字符类。那就逃离他们吧!否则他们只会匹配字面量s和字面量W。但是\W已经包含了您在那里列出的所有其他内容,因此\W单独(不带方括号)就足够了。而最后一部分(^a-zA-Z)也不起作用,因为它只会包括^()和所有字母到字符类。否定语法仅适用于整个字符类,如[^a-zA-Z].

What you actually want is to assert that there is no letter in front or after your u. You can use lookaroundsfor that. The advantage is that they won't be included in the match and thus won't be removed:

你真正想要的是断言在你的u. 您可以为此使用环视。优点是它们不会被包含在匹配中,因此不会被删除:

r'(?<![a-zA-Z])[uU](?![a-zA-Z])'

Note that I used a raw string. Is generally good practice for regular expressions, to avoid problems with escape sequences.

请注意,我使用了原始字符串。通常是正则表达式的好习惯,以避免转义序列出现问题。

These are negative lookarounds that make sure that there is no letter character before or after your u. This is an important difference to asserting that there is a non-letter character around (which is similar to what you did), because the latter approach won't work at the beginning or end of the string.

这些是消极的环顾四周,确保在您的u. 这与断言周围有一个非字母字符(这与您所做的类似)有一个重要区别,因为后一种方法在字符串的开头或结尾不起作用。

Of course, you can remove the spaces around youfrom the replacement string.

当然,您可以you从替换字符串中删除周围的空格。

If you don't want to replace uthat are next to digits, you can easily include the digits into the character classes:

如果您不想替换u数字旁边的数字,您可以轻松地将数字包含到字符类中:

r'(?<![a-zA-Z0-9])[uU](?![a-zA-Z0-9])'

And if for some reason an adjacent underscore would also disqualify your ufor replacement, you could include that as well. But then the character class coincides with the built-in \w:

如果由于某种原因相邻的下划线也会取消您u的替换资格,您也可以将其包括在内。但是字符类与内置的\w

r'(?<!\w)[uU](?!\w)'

Which is, in this case, equivalent to EarlGray's r'\b[uU]\b'.

在这种情况下,这相当于 EarlGray 的r'\b[uU]\b'.

As mentioned above you can shorten all of these, by using the case-insensitive modifier. Taking the first expression as an example:

如上所述,您可以使用不区分大小写的修饰符来缩短所有这些。以第一个表达式为例:

re.sub(r'(?<![a-z])u(?![a-z])', 'you', text, flags=re.I)

or

或者

re.sub(r'(?<![a-z])u(?![a-z])', 'you', text, flags=re.IGNORECASE)

depending on your preference.

取决于您的喜好。

I suggest that you do some reading through the tutorial I linked several times in this answer. The explanations are very comprehensive and should give you a good headstart on regular expressions, which you will probably encounter again sooner or later.

我建议您阅读我在此答案中多次链接的教程。解释非常全面,应该让您对正则表达式有一个良好的开端,您可能迟早会再次遇到正则表达式。

回答by Dmytro Sirenko

Use a special character \b, which matches empty string at the beginning or at the end of a word:

使用特殊字符\b,匹配单词开头或结尾的空字符串:

print re.sub(r'\b[uU]\b', 'you', text)

spaces are not a reliable solution because there are also plenty of other punctuation marks, so an abstract character \bwas invented to indicate a word's beginning or end.

空格不是一个可靠的解决方案,因为还有很多其他标点符号,所以\b发明了一个抽象字符来表示一个词的开头或结尾。

回答by Edward

Another possible solution I came up with was:

我想出的另一个可能的解决方案是:

re.sub(r'([uU]+(.)?\s)',' you ', text)

回答by ricdeez

This worked for me:

这对我有用:

    import re
    text = 'how are u? umberella u! u. U. U@ U# u '
    rex = re.compile(r'\bu\b', re.IGNORECASE)
    print(rex.sub('you', text))

It pre-compiles the regular expression and makes use of re.IGNORECASE so that we don't have to worry about case in our regular expression! BTW, I love the funky spelling of umbrella! :-)

它预编译正则表达式并使用 re.IGNORECASE 这样我们就不必担心正则表达式中的大小写!顺便说一句,我喜欢伞的时髦拼写!:-)

回答by Jagadanna

it can also be achieved with below code

也可以用下面的代码来实现

import re

text = 'how are u? umberella u! u. U. U@ U# u '
print (re.sub (r'[uU] ( [^a-z] )', r' you ', text))

or

或者

print (re.sub (r'[uU] ( [\s!,.?@#] )', r' you ', text))