Python 从字符串中删除单词列表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25346058/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Removing list of words from a string
提问by Rohit Shinde
I have a list of stopwords. And I have a search string. I want to remove the words from the string.
我有一个停用词列表。我有一个搜索字符串。我想从字符串中删除单词。
As an example:
举个例子:
stopwords=['what','who','is','a','at','is','he']
query='What is hello'
Now the code should strip 'What' and 'is'. However in my case it strips 'a', as well as 'at'. I have given my code below. What could I be doing wrong?
现在代码应该去掉“什么”和“是”。但是,在我的情况下,它会去掉“a”和“at”。我在下面给出了我的代码。我可能做错了什么?
for word in stopwords:
if word in query:
print word
query=query.replace(word,"")
If the input query is "What is Hello", I get the output as:wht s llo
如果输入查询是“什么是你好”,我得到的输出为:wht s llo
Why does this happen?
为什么会发生这种情况?
采纳答案by Robby Cornelissen
This is one way to do it:
这是一种方法:
query = 'What is hello'
stopwords = ['what','who','is','a','at','is','he']
querywords = query.split()
resultwords = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)
print result
I noticed that you want to also remove a word if its lower-case variant is in the list, so I've added a call to lower()in the condition check.
我注意到如果列表中的小写变体,你还想删除一个单词,所以我lower()在条件检查中添加了一个调用。
回答by pseudonym
building on what karthikr said, try
建立在 karthikr 所说的基础上,尝试
' '.join(filter(lambda x: x.lower() not in stopwords, query.split()))
explanation:
解释:
query.split() #splits variable query on character ' ', e.i. "What is hello" -> ["What","is","hello"]
filter(func,iterable) #takes in a function and an iterable (list/string/etc..) and
# filters it based on the function which will take in one item at
# a time and return true.false
lambda x: x.lower() not in stopwords # anonymous function that takes in variable,
# converts it to lower case, and returns true if
# the word is not in the iterable stopwords
' '.join(iterable) #joins all items of the iterable (items must be strings/chars)
#using the string/char in front of the dot, i.e. ' ' as a joiner.
# i.e. ["What", "is","hello"] -> "What is hello"
回答by B.Adler
Looking at the other answers to your question I noticed that they told you how to do what you are trying to do, but they did not answer the question you posed at the end.
查看您问题的其他答案,我注意到他们告诉了您如何做您想做的事情,但他们没有回答您最后提出的问题。
If the input query is "What is Hello", I get the output as:
wht s lloWhy does this happen?
如果输入查询是“什么是你好”,我得到的输出为:
wht s llo为什么会发生这种情况?
This happens because .replace() replaces the substring you give it exactly.
发生这种情况是因为 .replace() 完全替换了您给它的子字符串。
for example:
例如:
"My, my! Hello my friendly mystery".replace("my", "")
gives:
给出:
>>> "My, ! Hello friendly stery"
.replace() is essentially splitting the string by the substring given as the first parameter and joining it back together with the second parameter.
.replace() 本质上是通过作为第一个参数给出的子字符串拆分字符串,并将其与第二个参数连接在一起。
"hello".replace("he", "je")
is logically similar to:
在逻辑上类似于:
"je".join("hello".split("he"))
If you were still wanting to use .replace to remove whole words you might think adding a space before and after would be enough, but this leaves out words at the beginning and end of the string as well as punctuated versions of the substring.
如果您仍然想使用 .replace 删除整个单词,您可能认为在前后添加一个空格就足够了,但这会遗漏字符串开头和结尾的单词以及子字符串的标点符号。
"My, my! hello my friendly mystery".replace(" my ", " ")
>>> "My, my! hello friendly mystery"
"My, my! hello my friendly mystery".replace(" my", "")
>>> "My,! hello friendlystery"
"My, my! hello my friendly mystery".replace("my ", "")
>>> "My, my! hello friendly mystery"
Additionally, adding spaces before and after will not catch duplicates as it has already processed the first sub-string and will ignore it in favor of continuing on:
此外,在前后添加空格不会捕获重复项,因为它已经处理了第一个子字符串并将忽略它以继续:
"hello my my friend".replace(" my ", " ")
>>> "hello my friend"
For these reasons your accepted answerby Robby Cornelissenis the recommended way to do what you are wanting.
由于这些原因,Robby Cornelissen接受的答案是推荐的方法来做你想做的事。
回答by Jean-Fran?ois Fabre
the accepted answer works when provided a list of words separated by spaces, but that's not the case in real life when there can be punctuation to separate the words. In that case re.splitis required.
当提供由空格分隔的单词列表时,接受的答案有效,但在现实生活中,当可以使用标点符号分隔单词时,情况并非如此。在这种情况下re.split是必需的。
Also, testing against stopwordsas a setmakes lookup faster (even if there's a tradeoff between string hashing & lookup when there's a small number of words)
此外,测试stopwordsas aset使查找更快(即使在字符串散列和查找之间存在少量单词时的权衡)
My proposal:
我的建议:
import re
query = 'What is hello? Says Who?'
stopwords = {'what','who','is','a','at','is','he'}
resultwords = [word for word in re.split("\W+",query) if word.lower() not in stopwords]
print(resultwords)
output (as list of words):
输出(作为单词列表):
['hello','Says']

