从python列表中删除字符串中所有出现的单词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15435726/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Remove all occurrences of words in a string from a python list
提问by Ogre
I'm trying to match and remove all words in a list from a string using a compiled regex but I'm struggling to avoid occurrences within words.
我正在尝试使用编译的正则表达式从字符串中匹配和删除列表中的所有单词,但我正在努力避免在单词中出现。
Current:
当前的:
REMOVE_LIST = ["a", "an", "as", "at", ...]
remove = '|'.join(REMOVE_LIST)
regex = re.compile(r'('+remove+')', flags=re.IGNORECASE)
out = regex.sub("", text)
In: "The quick brown fox jumped over an ant"
在:“敏捷的棕色狐狸跳过了一只蚂蚁”
Out: "quick brown fox jumped over t"
出:“快棕狐跳过了t”
Expected: "quick brown fox jumped over"
预期:“快棕狐跳过”
I've tried changing the string to compile to the following but to no avail:
我尝试更改字符串以编译为以下内容但无济于事:
regex = re.compile(r'\b('+remove+')\b', flags=re.IGNORECASE)
Any suggestions or am I missing something garishly obvious?
有什么建议还是我错过了一些非常明显的东西?
采纳答案by NPE
One problem is that only the first \bis inside a raw string. The second gets interpreted as the backspace character (ASCII 8) rather than as a word boundary.
一个问题是只有第一个\b在原始字符串中。第二个被解释为退格字符 (ASCII 8) 而不是单词边界。
To fix, change
修复,改变
regex = re.compile(r'\b('+remove+')\b', flags=re.IGNORECASE)
to
到
regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
^ THIS
回答by jurgenreza
here is a suggestion without using regex you may want to consider:
这是您可能需要考虑的不使用正则表达式的建议:
>>> sentence = 'word1 word2 word3 word1 word2 word4'
>>> remove_list = ['word1', 'word2']
>>> word_list = sentence.split()
>>> ' '.join([i for i in word_list if i not in remove_list])
'word3 word4'

