Python 从字符串中删除长度小于 4 的单词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24332025/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:26:34  来源:igfitidea点击:

Remove words of length less than 4 from string

pythonregex

提问by blackmamba

I am trying to remove words of length less than 4 from a string.

我试图从字符串中删除长度小于 4 的单词。

I use this regex:

我使用这个正则表达式:

 re.sub(' \w{1,3} ', ' ', c)

Though this removes some strings but it fails when 2-3 words of length less than 4 appear together. Like:

虽然这会删除一些字符串,但是当 2-3 个长度小于 4 的单词一起出现时它会失败。喜欢:

 I am in a bank.

It gives me:

它给了我:

 I in bank. 

How to resolve this?

如何解决这个问题?

采纳答案by Martijn Pieters

Don't include the spaces; use \bword boundary anchors instead:

不要包含空格;使用\b词边界锚代替:

re.sub(r'\b\w{1,3}\b', '', c)

This removes words of up to 3 characters entirely:

这将完全删除最多 3 个字符的单词:

>>> import re
>>> re.sub(r'\b\w{1,3}\b', '', 'The quick brown fox jumps over the lazy dog')
' quick brown  jumps over  lazy '
>>> re.sub(r'\b\w{1,3}\b', '', 'I am in a bank.')
'    bank.'

回答by Vidhya G

If you want an alternative to regex:

如果您想要替代正则表达式:

new_string = ' '.join([w for w in old_string.split() if len(w)>3])

回答by Sizik

Answered by Martijn, but I just wanted to explain why your regex doesn't work. The regex string ' \w{1,3} 'matches a space, followed by 1-3 word characters, followed by another space. The Idoesn't get matched because it doesn't have a space in front of it. The amgets replaced, and then the regex engine starts at the next non-matched character: the iin in. It doesn't see the space before in, since it was placed there by the substitution. So, the next match it finds is a, which produces your output string.

由 Martijn 回答,但我只是想解释为什么您的正则表达式不起作用。正则表达式字符串' \w{1,3} '匹配一个空格,后跟 1-3 个单词字符,然后是另一个空格。在I没有得到匹配,因为它没有在它前面的空间。将am被替换,然后在下一个非匹配字符的正则表达式引擎开始工作:iin。它没有看到之前的空间in,因为它被替换放置在那里。因此,它找到的下一个匹配项是a,它会生成您的输出字符串。