在 Python 3 中加速数百万个正则表达式替换
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42742810/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Speed up millions of regex replacements in Python 3
提问by pdanese
I'm using Python 3.5.2
我正在使用 Python 3.5.2
I have two lists
我有两个清单
- a list of about 750,000 "sentences" (long strings)
- a list of about 20,000 "words" that I would like to delete from my 750,000 sentences
- 大约 750,000 个“句子”(长字符串)的列表
- 我想从我的 750,000 个句子中删除的大约 20,000 个“单词”的列表
So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.
因此,我必须遍历 750,000 个句子并执行大约 20,000 次替换,但前提是我的单词实际上是“单词”并且不是更大字符串的一部分。
I am doing this by pre-compilingmy words so that they are flanked by the \b
metacharacter
我是通过预编译我的话来做到这一点的,这样它们的两侧就是\b
元字符
compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
Then I loop through my "sentences"
然后我循环我的“句子”
import re
for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word, "", sentence)
# put sentence into a growing list
This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.
这个嵌套循环每秒处理大约50 个句子,这很好,但处理我所有的句子仍然需要几个小时。
Is there a way to using the
str.replace
method (which I believe is faster), but still requiring that replacements only happen at word boundaries?Alternatively, is there a way to speed up the
re.sub
method? I have already improved the speed marginally by skipping overre.sub
if the length of my word is > than the length of my sentence, but it's not much of an improvement.
有没有办法使用该
str.replace
方法(我认为它更快),但仍然要求替换只发生在单词边界?或者,有没有办法加速该
re.sub
方法?re.sub
如果我的单词的长度大于我的句子的长度,我已经通过跳过来稍微提高了速度,但这并没有太大的改进。
Thank you for any suggestions.
谢谢你的任何建议。
采纳答案by Liteye
One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b"
.
您可以尝试的一件事是编译一个单一的模式,如"\b(word1|word2|word3)\b"
.
Because re
relies on C code to do the actual matching, the savings can be dramatic.
因为re
依赖于 C 代码来进行实际匹配,所以节省的成本非常可观。
As @pvg pointed out in the comments, it also benefits from single pass matching.
正如@pvg 在评论中指出的那样,它也受益于单程匹配。
If your words are not regex, Eric's answeris faster.
如果您的话不是正则表达式,Eric 的回答会更快。
回答by Eric Duminil
TLDR
TLDR
Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP's, it's approximately 2000 times faster than the accepted answer.
如果您想要最快的解决方案,请使用此方法(使用设置查找)。对于类似于 OP 的数据集,它比接受的答案快大约 2000 倍。
If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.
如果您坚持使用正则表达式进行查找,请使用这个基于 trie 的版本,它仍然比正则表达式联合快 1000 倍。
Theory
理论
If your sentences aren't humongous strings, it's probably feasible to process many more than 50 per second.
如果您的句子不是庞大的字符串,那么每秒处理超过 50 个可能是可行的。
If you save all the banned words into a set, it will be very fast to check if another word is included in that set.
如果您将所有禁用词保存到一个集合中,则检查该集合中是否包含另一个词会非常快。
Pack the logic into a function, give this function as argument to re.sub
and you're done!
将逻辑打包到一个函数中,将此函数作为参数提供给re.sub
,你就完成了!
Code
代码
import re
with open('/usr/share/dict/american-english') as wordbook:
banned_words = set(word.strip().lower() for word in wordbook)
def delete_banned_words(matchobj):
word = matchobj.group(0)
if word.lower() in banned_words:
return ""
else:
return word
sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
"GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000
word_pattern = re.compile('\w+')
for sentence in sentences:
sentence = word_pattern.sub(delete_banned_words, sentence)
Converted sentences are:
转换的句子是:
' . !
.
GiraffeElephantBoat
sfgsdg sdwerha aswertwe
Note that:
注意:
- the search is case-insensitive (thanks to
lower()
) - replacing a word with
""
might leave two spaces (as in your code) - With python3,
\w+
also matches accented characters (e.g."?ngstr?m"
). - Any non-word character (tab, space, newline, marks, ...) will stay untouched.
- 搜索不区分大小写(感谢
lower()
) - 替换一个单词
""
可能会留下两个空格(如您的代码) - 使用 python3,
\w+
也匹配重音字符(例如"?ngstr?m"
)。 - 任何非单词字符(制表符、空格、换行符、标记等)将保持不变。
Performance
表现
There are a million sentences, banned_words
has almost 100000 words and the script runs in less than 7s.
有一百万个句子,banned_words
有近 100000 个单词,脚本运行时间不到 7 秒。
In comparison, Liteye's answerneeded 160s for 10 thousand sentences.
相比之下,Liteye 的回答需要 160 秒才能完成 1 万个句子。
With n
being the total amound of words and m
the amount of banned words, OP's and Liteye's code are O(n*m)
.
随着n
是的话总amound和m
明令禁止的单词的数量,OP的和Liteye的代码是O(n*m)
。
In comparison, my code should run in O(n+m)
. Considering that there are many more sentences than banned words, the algorithm becomes O(n)
.
相比之下,我的代码应该在O(n+m)
. 考虑到句子比禁止词多得多,算法变为O(n)
。
Regex union test
正则表达式联合测试
What's the complexity of a regex search with a '\b(word1|word2|...|wordN)\b'
pattern? Is it O(N)
or O(1)
?
使用'\b(word1|word2|...|wordN)\b'
模式进行正则表达式搜索的复杂性如何?是O(N)
还是O(1)
?
It's pretty hard to grasp the way the regex engine works, so let's write a simple test.
很难掌握正则表达式引擎的工作方式,所以让我们编写一个简单的测试。
This code extracts 10**i
random english words into a list. It creates the corresponding regex union, and tests it with different words :
此代码将10**i
随机的英语单词提取到列表中。它创建相应的正则表达式联合,并用不同的词对其进行测试:
- one is clearly not a word (it begins with
#
) - one is the first word in the list
- one is the last word in the list
- one looks like a word but isn't
- one 显然不是一个词(以 开头
#
) - one 是列表中的第一个单词
- 一个是列表中的最后一个词
- 一个看起来像一个词但不是
import re
import timeit
import random
with open('/usr/share/dict/american-english') as wordbook:
english_words = [word.strip().lower() for word in wordbook]
random.shuffle(english_words)
print("First 10 words :")
print(english_words[:10])
test_words = [
("Surely not a word", "#surely_N?T?WORD_so_regex_engine_can_return_fast"),
("First word", english_words[0]),
("Last word", english_words[-1]),
("Almost a word", "couldbeaword")
]
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("\nUnion of %d words" % 10**exp)
union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %-17s : %.1fms" % (description, time))
It outputs:
它输出:
First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']
Union of 10 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 0.7ms
Almost a word : 0.7ms
Union of 100 words
Surely not a word : 0.7ms
First word : 1.1ms
Last word : 1.2ms
Almost a word : 1.2ms
Union of 1000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 9.6ms
Almost a word : 10.1ms
Union of 10000 words
Surely not a word : 1.4ms
First word : 1.8ms
Last word : 96.3ms
Almost a word : 116.6ms
Union of 100000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 1227.1ms
Almost a word : 1404.1ms
So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b'
pattern has:
所以看起来搜索一个带有'\b(word1|word2|...|wordN)\b'
模式的单词有:
O(1)
best caseO(n/2)
average case, which is stillO(n)
O(n)
worst case
O(1)
最好的情况O(n/2)
平均情况,这仍然是O(n)
O(n)
最差的情况
These results are consistent with a simple loop search.
这些结果与简单的循环搜索一致。
A much faster alternative to a regex union is to create the regex pattern from a trie.
回答by Eric Duminil
TLDR
TLDR
Use this method if you want the fastest regex-based solution. For a dataset similar to the OP's, it's approximately 1000 times faster than the accepted answer.
如果您想要最快的基于正则表达式的解决方案,请使用此方法。对于类似于 OP 的数据集,它比接受的答案快大约 1000 倍。
If you don't care about regex, use this set-based version, which is 2000 times faster than a regex union.
如果您不关心正则表达式,请使用这个基于集合的版本,它比正则表达式联合快 2000 倍。
Optimized Regex with Trie
使用 Trie 优化正则表达式
A simple Regex unionapproach becomes slow with many banned words, because the regex engine doesn't do a very good jobof optimizing the pattern.
一个简单的 Regex union方法在包含许多禁用词时会变得很慢,因为 regex 引擎在优化模式方面做得不是很好。
It's possible to create a Triewith all the banned words and write the corresponding regex. The resulting trie or regex aren't really human-readable, but they do allow for very fast lookup and match.
可以使用所有禁用的单词创建一个Trie并编写相应的正则表达式。生成的特里或正则表达式并不是人类可读的,但它们确实允许非常快速的查找和匹配。
Example
例子
['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']
The list is converted to a trie:
该列表被转换为一个特里:
{
'f': {
'o': {
'o': {
'x': {
'a': {
'r': {
'': 1
}
}
},
'b': {
'a': {
'r': {
'': 1
},
'h': {
'': 1
}
}
},
'z': {
'a': {
'': 1,
'p': {
'': 1
}
}
}
}
}
}
}
And then to this regex pattern:
然后到这个正则表达式模式:
r"\bfoo(?:ba[hr]|xar|zap?)\b"
The huge advantage is that to test if zoo
matches, the regex engine only needs to compare the first character(it doesn't match), instead of trying the 5 words. It's a preprocess overkill for 5 words, but it shows promising results for many thousand words.
巨大的优势是要测试是否zoo
匹配,正则表达式引擎只需要比较第一个字符(不匹配),而不是尝试 5 个 words。这是 5 个单词的预处理过度,但它显示了数千个单词的有希望的结果。
Note that (?:)
non-capturing groups are used because:
请注意,使用(?:)
非捕获组是因为:
foobar|baz
would matchfoobar
orbaz
, but notfoobaz
foo(bar|baz)
would save unneeded information to a capturing group.
Code
代码
Here's a slightly modified gist, which we can use as a trie.py
library:
这是一个稍微修改的gist,我们可以将其用作trie.py
库:
import re
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return re.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
Test
测试
Here's a small test (the same as this one):
这是一个小测试(与此相同):
# Encoding: utf-8
import re
import timeit
import random
from trie import Trie
with open('/usr/share/dict/american-english') as wordbook:
banned_words = [word.strip().lower() for word in wordbook]
random.shuffle(banned_words)
test_words = [
("Surely not a word", "#surely_N?T?WORD_so_regex_engine_can_return_fast"),
("First word", banned_words[0]),
("Last word", banned_words[-1]),
("Almost a word", "couldbeaword")
]
def trie_regex_from_words(words):
trie = Trie()
for word in words:
trie.add(word)
return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("\nTrieRegex of %d words" % 10**exp)
union = trie_regex_from_words(banned_words[:10**exp])
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %s : %.1fms" % (description, time))
It outputs:
它输出:
TrieRegex of 10 words
Surely not a word : 0.3ms
First word : 0.4ms
Last word : 0.5ms
Almost a word : 0.5ms
TrieRegex of 100 words
Surely not a word : 0.3ms
First word : 0.5ms
Last word : 0.9ms
Almost a word : 0.6ms
TrieRegex of 1000 words
Surely not a word : 0.3ms
First word : 0.7ms
Last word : 0.9ms
Almost a word : 1.1ms
TrieRegex of 10000 words
Surely not a word : 0.1ms
First word : 1.0ms
Last word : 1.2ms
Almost a word : 1.2ms
TrieRegex of 100000 words
Surely not a word : 0.3ms
First word : 1.2ms
Last word : 0.9ms
Almost a word : 1.6ms
For info, the regex begins like this:
有关信息,正则表达式如下所示:
(?:a(?:(?:\'s|a(?:\'s|chen|liyah(?:\'s)?|r(?:dvark(?:(?:\'s|s))?|on))|b(?:\'s|a(?:c(?:us(?:(?:\'s|es))?|[ik])|ft|lone(?:(?:\'s|s))?|ndon(?:(?:ed|ing|ment(?:\'s)?|s))?|s(?:e(?:(?:ment(?:\'s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\'s)?|[ds]))?|ing|ttheitroad(?:(?:\'s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\'s|es))?|y(?:(?:\'s|s))?)|ot(?:(?:\'s|t(?:\'s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|y(?:\'s)?|\é(?:(?:\'s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|om(?:en(?:(?:\'s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\'s|s))?)|or(?:(?:\'s|s))?|s))?|l(?:\'s)?))|e(?:(?:\'s|am|l(?:(?:\'s|ard|son(?:\'s)?))?|r(?:deen(?:\'s)?|nathy(?:\'s)?|ra(?:nt|tion(?:(?:\'s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\'s|s))?|d)|ing|or(?:(?:\'s|s))?)|s))?|yance(?:\'s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\'s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\'s)?)|gail|l(?:ene|it(?:ies|y(?:\'s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\'s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\'s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\'s|s))?|y)|m\'s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\'s)?))|r(?:\'s)?)|ormal(?:(?:it(?:ies|y(?:\'s)?)|ly))?)|o(?:ard|de(?:(?:\'s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\'s|ist(?:(?:\'s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|r(?:igin(?:al(?:(?:\'s|s))?|e(?:(?:\'s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\'s|ist(?:(?:\'s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\'s|board))?)|r(?:a(?:cadabra(?:\'s)?|d(?:e[ds]?|ing)|ham(?:\'s)?|m(?:(?:\'s|s))?|si(?:on(?:(?:\'s|s))?|ve(?:(?:\'s|ly|ness(?:\'s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\'s|s))?|[ds]))?|ing|ment(?:(?:\'s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\'s)?))?)|s(?:alom|c(?:ess(?:(?:\'s|e[ds]|ing))?|issa(?:(?:\'s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\'s|s))?|t(?:(?:e(?:e(?:(?:\'s|ism(?:\'s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\'s|e(?:\'s)?))?|o(?:l(?:ut(?:e(?:(?:\'s|ly|st?))?|i(?:on(?:\'s)?|sm(?:\'s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\'s)?|t(?:(?:\'s|s))?)|d)|ing|s))?|pti...
(?:a(?:(?:\'s|a(?:\'s|chen|liyah(?:\'s)?|r(?:dvark(?:(?:\'s|s) ))?|on))|b(?:\'s|a(?:c(?:us(?:(?:\'s|es))?|[ik])|ft|lone(? :(?:\'s|s))?|ndon(?:(?:ed|ing|ment(?:\'s)?|s))?|s(?:e(?:(?: ment(?:\'s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment() ?:\'s)?|[ds]))?|ing|ttheitroad(?:(?:\'s|s))?))|b(?:as(?:id)?|e(? :ss(?:(?:\'s|es))?|y(?:(?:\'s|s))?)|ot(?:(?:\'s|t(?:\ 's)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|y(?:\' s)?|\é(?:(?:\'s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\ 's|s))?))|om(?:en(?:(?:\'s|s))?|inal)|u(?:ct(?:(?:ed|i(?: ng|on(?:(?:\'s|s))?)|or(?:(?:\'s|s))?|s))?|l(?:\'s)?) )|e(?:(?:\'s|am|l(?:(?:\'s|ard|son(?:\'s)?))?|r(?:deen(?:\ 's)?|nathy(?:\'s)?|ra(?:nt|tion(?:(?:\'s|s))?))|t(?:(?:t(?: e(?:r(?:(?:\'s|s))?|d)|ing|or(?:(?:\'s|s))?)|s))?|yance(?:\'s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(? :\'s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\'s)?)|gail| l(?:ene|it(?:ies|y(?:\'s)?))))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\') s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\'s|s))?|ze)|e(?:(? :st|r))?|oom|ution(?:(?:\'s|s))?|y)|m\'s|n(?:e(?:gat(?:e[ds]) ?|i(?:ng|on(?:\'s)?))|r(?:\'s)?)|ormal(?:(?:it(?:ies|y(?:\') s)?)|ly))?)|o(?:ard|de(?:(?:\'s|s))?|li(?:sh(?:(?:e[ds]|ing) ))?|tion(?:(?:\'s|ist(?:(?:\'s|s))?))?)|mina(?:bl[ey]|t(?:e[ ds]?|i(?:ng|on(?:(?:\'s|s))?)))|r(?:igin(?:al(?:(?:\'s|s)) )?|e(?:(?:\'s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\'s|ist(?: (?:\'s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve (?:(?:\'s|board))?)|r(?:a(?:cadabra(?:\'s)?|d(?:e[ds]?|ing)|ham(? :\'s)?|m(?:(?:\'s|s))?|si(?:on(?:(?:\'s|s))?|ve(?:(?:\'s|ly|ness(?:\'s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\'s|s)) ?|[ds]))?|ing|ment(?:(?:\'s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|) on(?:(?:\'s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\'s)?))?) |s(?:alom|c(?:ess(?:(?:\'s|e[ds]|ing))?|issa(?:(?:\'s|[es]))?| ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\'s|s))?|t(?:(?:e(?:e() ?:(?:\'s|ism(?:\'s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\'s|e( ?:\'s)?))?|o(?:l(?:ut(?:e(?:(?:\'s|ly|st?))?|i(?:on(?: \'s)?|sm(?:\'s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(? :cy(?:\'s)?|t(?:(?:\'s|s))?)|d)|ing|s))?|pti...s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\'s|s))?|t(?: (?:e(?:e(?:(?:\'s|ism(?:\'s)?|s))?|d)|ing|ly|s))?)|inth(?: (?:\'s|e(?:\'s)?))?|o(?:l(?:ut(?:e(?:(?:\'s|ly|st?))? |i(?:on(?:\'s)?|sm(?:\'s)?))|v(?:e[ds]?|ing))|r(?:b(?:( ?:e(?:n(?:cy(?:\'s)?|t(?:(?:\'s|s))?)|d)|ing|s))?|pti.. .s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\'s|s))?|t(?: (?:e(?:e(?:(?:\'s|ism(?:\'s)?|s))?|d)|ing|ly|s))?)|inth(?: (?:\'s|e(?:\'s)?))?|o(?:l(?:ut(?:e(?:(?:\'s|ly|st?))? |i(?:on(?:\'s)?|sm(?:\'s)?))|v(?:e[ds]?|ing))|r(?:b(?:( ?:e(?:n(?:cy(?:\'s)?|t(?:(?:\'s|s))?)|d)|ing|s))?|pti.. .
It's really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!
真的不可读,但是对于 100000 个禁用词的列表,这个 Trie regex 比简单的 regex union 快 1000 倍!
Here's a diagram of the complete trie, exported with trie-python-graphvizand graphviz twopi
:
这是完整的特里图,用trie-python-graphviz和 graphviz导出twopi
:
回答by Denziloe
One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.
您可能想要尝试的一件事是预处理句子以对单词边界进行编码。基本上通过在单词边界上拆分将每个句子变成单词列表。
This should be faster, because to process a sentence, you just have to step through each of the words and check if it's a match.
这应该更快,因为要处理一个句子,您只需要遍历每个单词并检查它是否匹配。
Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then "discarding" the result of this work before the next pass.
目前,正则表达式搜索每次都必须再次遍历整个字符串,寻找单词边界,然后在下一次传递之前“丢弃”这项工作的结果。
回答by peufeu
Well, here's a quick and easy solution, with test set.
好吧,这是一个带有测试集的快速简便的解决方案。
Winning strategy:
获胜策略:
re.sub("\w+",repl,sentence) searches for words.
re.sub("\w+",repl,sentence) 搜索单词。
"repl" can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.
“repl”可以是可调用的。我使用了一个执行 dict 查找的函数,该 dict 包含要搜索和替换的单词。
This is the simplest and fastest solution (see function replace4 in example code below).
这是最简单和最快的解决方案(请参阅下面示例代码中的函数 replace4)。
Second best
次好的
The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.
这个想法是使用 re.split 将句子拆分为单词,同时保留分隔符以稍后重建句子。然后,通过简单的 dict 查找完成替换。
(see function replace3 in example code below).
(请参阅下面示例代码中的函数 replace3)。
Timings for example functions:
示例功能的时间:
replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)
...and code:
...和代码:
#! /bin/env python3
# -*- coding: utf-8
import time, random, re
def replace1( sentences ):
for n, sentence in enumerate( sentences ):
for search, repl in patterns:
sentence = re.sub( "\b"+search+"\b", repl, sentence )
def replace2( sentences ):
for n, sentence in enumerate( sentences ):
for search, repl in patterns_comp:
sentence = re.sub( search, repl, sentence )
def replace3( sentences ):
pd = patterns_dict.get
for n, sentence in enumerate( sentences ):
#~ print( n, sentence )
# Split the sentence on non-word characters.
# Note: () in split patterns ensure the non-word characters ARE kept
# and returned in the result list, so we don't mangle the sentence.
# If ALL separators are spaces, use string.split instead or something.
# Example:
#~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
#~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
words = re.split(r"([^\w]+)", sentence)
# and... done.
sentence = "".join( pd(w,w) for w in words )
#~ print( n, sentence )
def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
w = m.group()
return pd(w,w)
for n, sentence in enumerate( sentences ):
sentence = re.sub(r"\w+", repl, sentence)
# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]
# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\b"+search+"\b"), repl) for search, repl in patterns ]
def test( func, num ):
t = time.time()
func( test_sentences[:num] )
print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))
print( "Sentences", len(test_sentences) )
print( "Words ", len(test_words) )
test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )
Edit: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl
编辑:您也可以在检查是否传递小写句子列表并编辑 repl 时忽略小写
def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
w = m.group()
return pd(w.lower(),w)
回答by karakfa
Perhaps Python is not the right tool here. Here is one with the Unix toolchain
也许 Python 不是这里的正确工具。这是 Unix 工具链中的一个
sed G file |
tr ' ' '\n' |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{=}1'
assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.
假设您的黑名单文件经过预处理,并添加了单词边界。步骤为:将文件转换为双倍行距,将每个句子拆分为每行一个单词,从文件中批量删除黑名单单词,并合并回行。
This should run at least an order of magnitude faster.
这应该至少快一个数量级。
For preprocessing the blacklist file from words (one word per line)
用于从单词预处理黑名单文件(每行一个单词)
sed 's/.*/\b&\b/' words > blacklist
回答by Lie Ryan
How about this:
这个怎么样:
#!/usr/bin/env python3
from __future__ import unicode_literals, print_function
import re
import time
import io
def replace_sentences_1(sentences, banned_words):
# faster on CPython, but does not use \b as the word separator
# so result is slightly different than replace_sentences_2()
def filter_sentence(sentence):
words = WORD_SPLITTER.split(sentence)
words_iter = iter(words)
for word in words_iter:
norm_word = word.lower()
if norm_word not in banned_words:
yield word
yield next(words_iter) # yield the word separator
WORD_SPLITTER = re.compile(r'(\W+)')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))
def replace_sentences_2(sentences, banned_words):
# slower on CPython, uses \b as separator
def filter_sentence(sentence):
boundaries = WORD_BOUNDARY.finditer(sentence)
current_boundary = 0
while True:
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
yield sentence[last_word_boundary:current_boundary] # yield the separators
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
word = sentence[last_word_boundary:current_boundary]
norm_word = word.lower()
if norm_word not in banned_words:
yield word
WORD_BOUNDARY = re.compile(r'\b')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))
corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
output.write(sentence.encode('utf-8'))
output.write(b' .')
print('time:', time.time() - start)
These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes' solution) as these solutions are O(n)
where n is the size of the input due to the amortized O(1)
set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn't compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don't care about it, it should be fairly straightforward to remove them from the output.
这些解决方案在单词边界上拆分并查找集合中的每个单词。它们应该比单词替代的 re.sub(Liteyes 的解决方案)更快,因为这些解决方案O(n)
中 n 是由于amortized O(1)
设置查找而导致的输入大小,而使用正则表达式替代将导致正则表达式引擎必须检查单词匹配在每个字符上,而不仅仅是在单词边界上。我的解决方案特别注意保留原始文本中使用的空格(即它不压缩空格并保留制表符、换行符和其他空格字符),但是如果您决定不关心它,它从输出中删除它们应该相当简单。
I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu's wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I've defined sentences as anything separated by ". ".
我在 corpus.txt 上进行了测试,它是从 Gutenberg 项目下载的多本电子书的串联,banned_words.txt 是从 Ubuntu 的词表 (/usr/share/dict/american-english) 中随机选取的 20000 个单词。处理 862462 个句子大约需要 30 秒(其中一半在 PyPy 上)。我将句子定义为以“.”分隔的任何内容。
$ # replace_sentences_1()
$ python3 filter_words.py
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py
number of sentences: 862462
time: 15.9370770454
$ # replace_sentences_2()
$ python3 filter_words.py
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py
number of sentences: 862462
time: 13.1190629005
PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.
PyPy 从第二种方法中获益更多,而 CPython 在第一种方法中表现更好。上面的代码应该适用于 Python 2 和 3。
回答by I159
Practical approach
实用方法
A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.
下面描述的解决方案使用大量内存将所有文本存储在同一字符串中并降低复杂性级别。如果 RAM 是一个问题,请在使用前三思。
With join
/split
tricks you can avoid loops at all which should speed up the algorithm.
使用join
/split
技巧,您可以完全避免循环,从而加速算法。
merged_sentences = ' * '.join(sentences)
|
"or" regex statement:|
“或”正则表达式语句为您需要从句子中删除的所有单词编译一个正则表达式:regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')
Performance
表现
"".join
complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:
"".join
复杂度是 O(n)。这是非常直观的,但无论如何有一个来源的简短引用:
for (i = 0; i < seqlen; i++) {
[...]
sz += PyUnicode_GET_LENGTH(item);
Therefore with join/split
you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.
因此,join/split
你有 O(words) + 2*O(sentences) 这仍然是线性复杂度 vs 2*O(N 2) 与初始方法。
b.t.w. don't use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.
顺便说一句,不要使用多线程。GIL 会阻塞每个操作,因为您的任务严格受 CPU 限制,因此 GIL 没有机会被释放,但每个线程都会并发发送滴答声,这会导致额外的工作量,甚至导致操作无限。
回答by Edi Bice
Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here's one) to locate all your "bad" words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.
将您的所有句子连接到一个文档中。使用 Aho-Corasick 算法的任何实现(这里是一个)来定位所有“坏”词。遍历文件,替换每个坏词,更新后面找到的词的偏移量等。