Python 使用 spacy 添加/删除自定义停用词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41170726/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Add/remove custom stop words with spacy
提问by E.K.
What is the best way to add/remove stop words with spacy? I am using token.is_stop
function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!
使用 spacy 添加/删除停用词的最佳方法是什么?我正在使用token.is_stop
函数并希望对集合进行一些自定义更改。我正在查看文档,但找不到有关停用词的任何信息。谢谢!
采纳答案by dantiston
You can edit them before processing your text like this (see this post):
您可以在像这样处理文本之前编辑它们(请参阅此帖子):
>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True
Note: This seems to work <=v1.8. For newer versions, see other answers.
注意:这似乎有效 <=v1.8。对于较新的版本,请参阅其他答案。
回答by Romain
Using Spacy 2.0.11, you can update its stopwords set using one of the following:
使用 Spacy 2.0.11,您可以使用以下方法之一更新其停用词集:
To add a single stopword:
添加单个停用词:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")
To add several stopwords at once:
一次添加多个停用词:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}
To remove a single stopword:
要删除单个停用词:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")
To remove several stopwords at once:
一次删除多个停用词:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}
Note: To see the current set of stopwords, use:
注意:要查看当前的停用词集,请使用:
print(nlp.Defaults.stop_words)
Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path")
and nlp.from_disk("/path")
(further described at https://spacy.io/usage/saving-loading).
更新:评论中指出此修复仅影响当前执行。要更新模型,您可以使用方法nlp.to_disk("/path")
和nlp.from_disk("/path")
(在https://spacy.io/usage/saving-loading 中进一步描述)。
回答by petezurich
For version 2.0 I used this:
对于 2.0 版,我使用了这个:
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS) # <- set of Spacy's default stop words
STOP_WORDS.add("your_additional_stop_word_here")
for word in STOP_WORDS:
lexeme = nlp.vocab[word]
lexeme.is_stop = True
This loads all stop words into a set.
这会将所有停用词加载到一个集合中。
You can amend your stop words to STOP_WORDS
or use your own list in the first place.
您可以首先将停用词修改为STOP_WORDS
或使用您自己的列表。
回答by harryhorn
For 2.0 use the following:
对于 2.0,请使用以下内容:
for word in nlp.Defaults.stop_words:
lex = nlp.vocab[word]
lex.is_stop = True
回答by SolitaryReaper
This collects the stop words too :)
这也收集停用词:)
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
回答by Sezin
In latest version following would remove the word out of the list:
在最新版本中,以下将删除列表中的单词:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.remove('not')