使用Python从字符串中删除与正则表达式匹配的重复行的最佳方法是什么?
时间:2020-03-06 15:05:14 来源:igfitidea点击:
这是非常简单的尝试。我没有使用python太久了。似乎可以工作,但我相信我有很多东西要学习。有人让我知道我是否要离开这里。需要查找模式,写出匹配的第一行,然后为匹配模式的其余连续行添加摘要消息,并返回修改后的字符串。
只是要清楚...正则表达式。* Dog。*
需要
Cat Dog My Dog Her Dog Mouse
然后返回
Cat Dog ::::: Pattern .*Dog.* repeats 2 more times. Mouse #!/usr/bin/env python # import re import types def remove_repeats (l_string, l_regex): """Take a string, remove similar lines and replace with a summary message. l_regex accepts strings and tuples. """ # Convert string to tuple. if type(l_regex) == types.StringType: l_regex = l_regex, for t in l_regex: r = '' p = '' for l in l_string.splitlines(True): if l.startswith('::::: Pattern'): r = r + l else: if re.search(t, l): # If line matches regex. m += 1 if m == 1: # If this is first match in a set of lines add line to file. r = r + l elif m > 1: # Else update the message string. p = "::::: Pattern '" + t + "' repeats " + str(m-1) + ' more times.\n' else: if p: # Write the message string if it has value. r = r + p p = '' m = 0 r = r + l if p: # Write the message if loop ended in a pattern. r = r + p p = '' l_string = r # Reset string to modified string. return l_string
解决方案
重新匹配器功能似乎可以满足要求:
def rematcher(re_str, iterable): matcher= re.compile(re_str) in_match= 0 for item in iterable: if matcher.match(item): if in_match == 0: yield item in_match+= 1 else: if in_match > 1: yield "%s repeats %d more times\n" % (re_str, in_match-1) in_match= 0 yield item if in_match > 1: yield "%s repeats %d more times\n" % (re_str, in_match-1) import sys, re for line in rematcher(".*Dog.*", sys.stdin): sys.stdout.write(line)
在情况下,最后的字符串应为:
final_string= '\n'.join(rematcher(".*Dog.*", your_initial_string.split("\n")))
更新了代码,使其更加有效
#!/usr/bin/env python # import re import types def remove_repeats (l_string, l_regex): """Take a string, remove similar lines and replace with a summary message. l_regex accepts strings/patterns or tuples of strings/patterns. """ # Convert string/pattern to tuple. if not hasattr(l_regex, '__iter__'): l_regex = l_regex, ret = [] last_regex = None count = 0 for line in l_string.splitlines(True): if last_regex: # Previus line matched one of the regexes if re.match(last_regex, line): # This one does too count += 1 continue # skip to next line elif count > 1: ret.append("::::: Pattern %r repeats %d more times.\n" % (last_regex, count-1)) count = 0 last_regex = None ret.append(line) # Look for other patterns that could match for regex in l_regex: if re.match(regex, line): # Found one last_regex = regex count = 1 break # exit inner loop return ''.join(ret)
首先,与不进行贪婪匹配相比,正则表达式匹配速度会更慢。
.*Dog.*
相当于
Dog
但后者的匹配速度更快,因为不涉及回溯。字符串越长," Dog"出现的可能性就越大,因此正则表达式引擎必须执行的回溯工作越多。实际上,"。* D"实际上保证了回溯。
也就是说,如何:
#! /usr/bin/env python import re # regular expressions import fileinput # read from STDIN or file my_regex = '.*Dog.*' my_matches = 0 for line in fileinput.input(): line = line.strip() if re.search(my_regex, line): if my_matches == 0: print(line) my_matches = my_matches + 1 else: if my_matches != 0: print('::::: Pattern %s repeats %i more times.' % (my_regex, my_matches - 1)) print(line) my_matches = 0
目前尚不清楚非相邻比赛会发生什么。
还不清楚单行匹配被不匹配的行包围时会发生什么。将" Doggy"和" Hula"追加到输入文件中,我们将获得匹配的消息" 0"更多次。