在 Python 中删除停用词的更快方法

Question

提问by mchangun

I am trying to remove stopwords from a string of text:

我正在尝试从文本字符串中删除停用词：

from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I'm thinking of using something like regex's re.subbut I don't know how to write the pattern for a set of words. Can someone give me a hand and I'm also happy to hear other possibly faster methods.

我正在处理 6 百万个这样的字符串，所以速度很重要。分析我的代码，最慢的部分是上面的几行，有没有更好的方法来做到这一点？我正在考虑使用诸如正则表达式之类的东西，re.sub但我不知道如何为一组单词编写模式。有人可以帮我一下吗，我也很高兴听到其他可能更快的方法。

Note: I tried someone's suggest of wrapping stopwords.words('english')with set()but that made no difference.

注：我想别人的建议包装的stopwords.words('english')使用set()但并没有区别。

Thank you.

谢谢你。

Answer 1

采纳答案by Andy Rimmer

Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

尝试缓存停用词对象，如下所示。每次调用函数时都构造它似乎是瓶颈。

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

我通过分析器运行了这个：python -m cProfile -scumulative test.py。相关线路张贴如下。

nCalls Cumulative Time

nCalls 累积时间

10000 7.723 words.py:7(testFuncOld)

10000 7.723 字.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

10000 0.140 字.py:11(testFuncNew)

So, caching the stopwords instance gives a ~70x speedup.

因此，缓存停用词实例可提供约 70 倍的加速。

Answer 2

回答by Krzysztof Szularz

First, you're creating stop words for each string. Create it once. Set would be great here indeed.

首先，您要为每个字符串创建停用词。创建一次。设置在这里确实很棒。

forbidden_words = set(stopwords.words('english'))

Later, get rid of []inside join. Use generator instead.

后来，摆脱[]里面join。改用发电机。

' '.join([x for x in ['a', 'b', 'c']])

replace to

替换为

' '.join(x for x in ['a', 'b', 'c'])

Next thing to deal with would be to make .split()yield values instead of returning an array. ~~I believe regexwould be good replacement here.~~See thist hreadfor why s.split()is actually fast.

接下来要处理的是生成.split()屈服值而不是返回数组。~~我相信regex这里会是很好的替代品。~~请参阅thist hread了解为什么s.split()实际上很快。

Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.

最后，并行执行这样的工作（删除 6m 字符串中的停用词）。那是一个完全不同的话题。

Answer 3

回答by Alfe

Use a regexp to remove all words which do not match:

使用正则表达式删除所有不匹配的单词：

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)

This will probably be wayfaster than looping yourself, especially for large input strings.

这大概是这样比你自己循环更快，特别是对于大的输入字符串。

If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.

如果文本中的最后一个单词被删除，您可能会有尾随空格。我建议分开处理。

Answer 4

回答by Gulshan Jangid

Sorry for late reply. Would prove useful for new users.

抱歉回复晚了。证明对新用户有用。

Create a dictionary of stopwords using collections library

Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))

from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
text = ' '.join([word for word in text.split() if word not in stopwords_dict])

使用集合库创建停用词字典

使用该字典进行非常快速的搜索（时间 = O(1)），而不是在列表中进行搜索（时间 = O（停用词））

from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
text = ' '.join([word for word in text.split() if word not in stopwords_dict])

在 Python 中删除停用词的更快方法

提问by mchangun

采纳答案by Andy Rimmer

回答by Krzysztof Szularz

回答by Alfe

回答by Gulshan Jangid

相关推荐

最近更新

标签

在 Python 中删除停用词的更快方法

提问by mchangun

采纳答案by Andy Rimmer

回答by Krzysztof Szularz

回答by Alfe

回答by Gulshan Jangid

相关推荐

AttributeError: 'NoneType' 对象没有属性 'lower' python

在 Python 中使用 None 值删除字典中键的正确方法

在python中求解非线性方程

Python 为什么在附加 Pandas 数据框时列顺序会发生变化？

相关推荐

最近更新

标签