Python 使用 spacy 添加/删除自定义停用词

Question

提问by E.K.

What is the best way to add/remove stop words with spacy? I am using token.is_stopfunction and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!

使用 spacy 添加/删除停用词的最佳方法是什么？我正在使用token.is_stop函数并希望对集合进行一些自定义更改。我正在查看文档，但找不到有关停用词的任何信息。谢谢！

Answer 1

采纳答案by dantiston

You can edit them before processing your text like this (see this post):

您可以在像这样处理文本之前编辑它们（请参阅此帖子）：

>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True

Note: This seems to work <=v1.8. For newer versions, see other answers.

注意：这似乎有效 <=v1.8。对于较新的版本，请参阅其他答案。

Answer 2

回答by Romain

Using Spacy 2.0.11, you can update its stopwords set using one of the following:

使用 Spacy 2.0.11，您可以使用以下方法之一更新其停用词集：

To add a single stopword:

添加单个停用词：

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")

To add several stopwords at once:

一次添加多个停用词：

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}

To remove a single stopword:

要删除单个停用词：

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")

To remove several stopwords at once:

一次删除多个停用词：

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}

Note: To see the current set of stopwords, use:

注意：要查看当前的停用词集，请使用：

print(nlp.Defaults.stop_words)

Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path")and nlp.from_disk("/path")(further described at https://spacy.io/usage/saving-loading).

更新：评论中指出此修复仅影响当前执行。要更新模型，您可以使用方法nlp.to_disk("/path")和nlp.from_disk("/path")（在https://spacy.io/usage/saving-loading 中进一步描述）。

Answer 3

回答by petezurich

For version 2.0 I used this:

对于 2.0 版，我使用了这个：

from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS) # <- set of Spacy's default stop words

STOP_WORDS.add("your_additional_stop_word_here")

for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

This loads all stop words into a set.

这会将所有停用词加载到一个集合中。

You can amend your stop words to STOP_WORDSor use your own list in the first place.

您可以首先将停用词修改为STOP_WORDS或使用您自己的列表。

Answer 4

回答by harryhorn

For 2.0 use the following:

对于 2.0，请使用以下内容：

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

Answer 5

回答by SolitaryReaper

This collects the stop words too :)

这也收集停用词:)

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

Answer 6

回答by Sezin

In latest version following would remove the word out of the list:

在最新版本中，以下将删除列表中的单词：

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.remove('not')

Python 使用 spacy 添加/删除自定义停用词

提问by E.K.

采纳答案by dantiston

回答by Romain

回答by petezurich

回答by harryhorn

回答by SolitaryReaper

回答by Sezin

相关推荐

最近更新

标签

Python 使用 spacy 添加/删除自定义停用词

提问by E.K.

采纳答案by dantiston

回答by Romain

回答by petezurich

回答by harryhorn

回答by SolitaryReaper

回答by Sezin

相关推荐

Python 将文件作为参数传递给 Docker 容器

Python 将索引转换为列熊猫数据框

Python 何时使用 Serializer 的 create() 和 ModelViewset 的 create() perform_create()

如何在 f-string 中使用换行符 '\n' 来格式化 Python 3.6 中的输出？

相关推荐

最近更新

标签