在 Python 中删除停用词的更快方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19560498/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Faster way to remove stop words in Python
提问by mchangun
I am trying to remove stopwords from a string of text:
我正在尝试从文本字符串中删除停用词:
from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])
I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I'm thinking of using something like regex's re.sub
but I don't know how to write the pattern for a set of words. Can someone give me a hand and I'm also happy to hear other possibly faster methods.
我正在处理 6 百万个这样的字符串,所以速度很重要。分析我的代码,最慢的部分是上面的几行,有没有更好的方法来做到这一点?我正在考虑使用诸如正则表达式之类的东西,re.sub
但我不知道如何为一组单词编写模式。有人可以帮我一下吗,我也很高兴听到其他可能更快的方法。
Note: I tried someone's suggest of wrapping stopwords.words('english')
with set()
but that made no difference.
注:我想别人的建议包装的stopwords.words('english')
使用set()
但并没有区别。
Thank you.
谢谢你。
采纳答案by Andy Rimmer
Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.
尝试缓存停用词对象,如下所示。每次调用函数时都构造它似乎是瓶颈。
from nltk.corpus import stopwords
cachedStopWords = stopwords.words("english")
def testFuncOld():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])
def testFuncNew():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in cachedStopWords])
if __name__ == "__main__":
for i in xrange(10000):
testFuncOld()
testFuncNew()
I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.
我通过分析器运行了这个:python -m cProfile -scumulative test.py。相关线路张贴如下。
nCalls Cumulative Time
nCalls 累积时间
10000 7.723 words.py:7(testFuncOld)
10000 7.723 字.py:7(testFuncOld)
10000 0.140 words.py:11(testFuncNew)
10000 0.140 字.py:11(testFuncNew)
So, caching the stopwords instance gives a ~70x speedup.
因此,缓存停用词实例可提供约 70 倍的加速。
回答by Krzysztof Szularz
First, you're creating stop words for each string. Create it once. Set would be great here indeed.
首先,您要为每个字符串创建停用词。创建一次。设置在这里确实很棒。
forbidden_words = set(stopwords.words('english'))
Later, get rid of []
inside join
. Use generator instead.
后来,摆脱[]
里面join
。改用发电机。
' '.join([x for x in ['a', 'b', 'c']])
replace to
替换为
' '.join(x for x in ['a', 'b', 'c'])
Next thing to deal with would be to make .split()
yield values instead of returning an array. I believe See thist hreadfor why regex
would be good replacement here.s.split()
is actually fast.
接下来要处理的是生成.split()
屈服值而不是返回数组。我相信请参阅thist hread了解为什么regex
这里会是很好的替代品。s.split()
实际上很快。
Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.
最后,并行执行这样的工作(删除 6m 字符串中的停用词)。那是一个完全不同的话题。
回答by Alfe
Use a regexp to remove all words which do not match:
使用正则表达式删除所有不匹配的单词:
import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)
This will probably be wayfaster than looping yourself, especially for large input strings.
这大概是这样比你自己循环更快,特别是对于大的输入字符串。
If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.
如果文本中的最后一个单词被删除,您可能会有尾随空格。我建议分开处理。
回答by Gulshan Jangid
Sorry for late reply. Would prove useful for new users.
抱歉回复晚了。证明对新用户有用。
- Create a dictionary of stopwords using collections library
Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))
from collections import Counter stop_words = stopwords.words('english') stopwords_dict = Counter(stop_words) text = ' '.join([word for word in text.split() if word not in stopwords_dict])
- 使用集合库创建停用词字典
使用该字典进行非常快速的搜索(时间 = O(1)),而不是在列表中进行搜索(时间 = O(停用词))
from collections import Counter stop_words = stopwords.words('english') stopwords_dict = Counter(stop_words) text = ' '.join([word for word in text.split() if word not in stopwords_dict])