Python:re.compile 和 re.sub
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18457101/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: re.compile and re.sub
提问by dustinboettcher
Question part 1
问题第 1 部分
I got this file f1:
我得到了这个文件 f1:
<something @37>
<name>George Washington</name>
<a23c>Joe Taylor</a23c>
</something @37>
and I want to re.compile it that it looks like this f1: (with spaces)
我想重新编译它,它看起来像这样 f1:(带空格)
George Washington Joe Taylor
I tried this code but it kinda deletes everything:
我试过这段代码,但它有点删除了所有内容:
import re
file = open('f1.txt')
fixed = open('fnew.txt','w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ',text)
fixed.write(fixed_doc)
My guess is the re.compile line but I'm not quite sure what to do with it. I'm not supposed to use 3rd party extensions. Any ideas?
Question part 2
我的猜测是 re.compile 行,但我不太确定如何处理它。我不应该使用 3rd 方扩展。有任何想法吗?
问题第 2 部分
I had a different question about comparing 2 files I got this code from Alfe:
我有一个关于比较 2 个文件的不同问题我从 Alfe 得到了这个代码:
from collections import Counter
def test():
with open('f1.txt') as f:
contentsI = f.read()
with open('f2.txt') as f:
contentsO = f.read()
tokensI = Counter(value for value in contentsI.split()
if value not in [])
tokensO = Counter(value for value in contentsO.split()
if value not in [])
return not (tokensI - tokensO) and not (set(tokensO) - set(tokensI))
Is it possible to implement the re.compile and re.sub in the 'if value not in []' section?
是否可以在“if value not in []”部分中实现 re.compile 和 re.sub?
采纳答案by eyquem
I will explain what happens with your code:
我将解释您的代码会发生什么:
import re
file = open('f1.txt')
fixed = open('fnew.txt','w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ',text)
fixed.write(fixed_doc)
The instruction text = file.read()
creates an object textof type stringnamed text
.
Note that I use bold characters textto express an OBJECT, and text
to express the name == IDENTIFIER of this object.
As a consequence of the instruction for unwanted in text:
, the identifier unwanted
is successively assigned to each character referenced by the textobject.
该指令text = file.read()
创建一个名为 的字符串类型的对象文本。
请注意,我使用粗体字符text来表示 OBJECT,并表示该对象的名称 == IDENTIFIER。
作为指令的结果,标识符被连续分配给文本对象引用的每个字符。 text
text
for unwanted in text:
unwanted
Besides, re.compile('<.*>')
creates an object of type RegexObject(which I personnaly call compiled) regexor simply regex, <.*>
being only the regex pattern).
You assign this compiled regex object to the identifier match
: it's a very bad practice, because match
is already the name of a method of regex objects in general, and of the one you created in particular, so then you could write match.match
without error.match
is also the name of a function of the remodule.
This use of this name for your particular need is very confusing. You must avoid that.
此外,re.compile('<.*>')
创建一个RegexObject类型的对象(我个人称其为已编译的)regex或简单的regex,<.*>
只是regex 模式)。
您将这个编译的正则表达式对象分配给标识符match
:这是一种非常糟糕的做法,因为match
通常已经是正则表达式对象的方法的名称,特别是您创建的方法的名称,因此您可以毫无错误地编写match.match
。match
也是re模块的函数名。
将此名称用于您的特定需求是非常令人困惑的。你必须避免这种情况。
There's the same flaw with the use of file
as a name for the file-handler of file f1. file
is already an identifier used in the language, you must avoid it.
将file
用作文件 f1 的文件处理程序的名称也存在同样的缺陷。file
已经是语言中使用的标识符,您必须避免使用它。
Well. Now this bad-named matchobject is defined, the instruction fixed_doc = match.sub(r' ',text)
replaces all the occurences found by the regex matchin textwith the replacement r' '
.
Note that it's completely superfluous to write r' '
instead of just ' '
because there's absolutely nothing in ' '
that needs to be escaped. It's a fad of some anxious people to write raw strings every time they have to write a string in a regex problem.
好。现在,这个坏命名的匹配对象的定义,指令fixed_doc = match.sub(r' ',text)
代替全部由正则表达式发现OCCURENCES比赛中的文字与更换r' '
。
请注意,写入完全是多余的,r' '
而不仅仅是' '
因为其中绝对没有任何东西' '
需要转义。每当他们不得不在正则表达式问题中编写字符串时,一些焦虑的人都会编写原始字符串,这是一种时尚。
Because of its pattern <.+>
in which the dot symbol means "greedily eat every character situated between a <
and a >
except if it is a newline character" , the occurences catched in the text by matchare each line until the last >
in it.
As the name unwanted
doesn't appear in this instruction, it is the same operation that is done for each character of the text, one after the other. That is to say: nothing interesting.
To analyze the execution of a programm, you should put some printing instructions in your code, allowing to understand what happens. For example, if you do print repr(fixed_doc)
, you'll see the repeated printing of this: ' \n \n \n '
. As I said: nothing interesting.
由于其模式<.+>
中点符号的意思是“贪婪地吃掉位于 a<
和 a之间的每个字符,>
除非它是换行符”,因此通过匹配在文本中捕获的出现是每一行,直到其中的最后一行>
。
由于该名称unwanted
未出现在此指令中,因此对文本的每个字符执行相同的操作,一个接一个。也就是说:没什么有趣的。
要分析程序的执行情况,您应该在代码中放入一些打印指令,以便了解会发生什么。例如,如果你这样做print repr(fixed_doc)
,你会看到这样的重复印刷:' \n \n \n '
。正如我所说:没什么有趣的。
There's one more default in your code: you open files, but you don't shut them. It is mandatory to shut files, otherwise it could happen some weird phenomenons, that I personnally observed in some of my codes before I realized this need. Some people pretend it isn't mandatory, but it's false.
By the way, the better manner to open and shut files is to use the with
statement. It does all the work without you have to worry about.
您的代码中还有一个默认设置:您打开文件,但不关闭它们。必须关闭文件,否则可能会发生一些奇怪的现象,这是我在意识到需要之前亲自在我的一些代码中观察到的。有些人假装这不是强制性的,但这是错误的。
顺便说一句,打开和关闭文件的更好方法是使用with
语句。它可以完成所有工作,您无需担心。
.
.
So , now I can propose you a code for your first problem:
所以,现在我可以为你的第一个问题提出一个代码:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
li.append(mat.span(2))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\1>))',
re.DOTALL)
text = '''<something @37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
结果
1------------------------------------1
<something @37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>
2------------------------------------2
George <wxc>Washington
Joe </zazaza>Taylor
3------------------------------------3
The principle is as follows:
原理如下:
When the regex detects a tag,
- if it's an end tag, it matches
- if it's a start tag, it matches only if there is a corresponding end tag somewhere further in the text
For each match, the method sub()
of the regex r
calls the function ripl()
to perform the replacement.
If the match is with a start tag (which is necessary followed somewhere in the text by its corresponding end tag, by construction of the regex), then ripl()
returns ''
.
If the match is with an end tag, ripl()
returns ''
only if this end tag has previously in the text been detected has being the corresponding end tag of a previous start tag. This is done possible by recording in a list lithe span of each corresponding end tag's span each time a start tag is detected and matching.
当正则表达式检测到一个标签时,
- 如果它是一个结束标签,它匹配 - 如果它是一个开始标签,它仅在文本中更远的某处有相应的结束标签时才
匹配对于每个匹配,sub()
正则表达式的方法r
调用该函数ripl()
执行更换。
如果匹配带有开始标记(必须在文本中的某处跟随其相应的结束标记,通过正则表达式的构造),则ripl()
返回''
.
如果匹配与结束标签,则仅当此结束标签之前在文本中检测到是前一个开始标签的相应结束标签时才ripl()
返回''
。这可以通过在列表li 中记录来完成每次检测到开始标签并匹配时,每个相应结束标签的跨度的跨度。
The recording list liis defined as a default argument in order that it's always the same list that is used at each call of the function ripl()
(please, refer to the functionning of default argument to undertsand, because it's subtle).
As a consequence of the definition of li
as a parameter receiving a default argument, the list object liwould retain all the spans recorded when analyzing several text in case several texts would be analyzed successively. In order to avoid the list lito retain spans of past text matches, it is necessary to make the list empty. I wrote the function so that the first parameter is defined with a default argument None
: that allows to call ripl()
without argument before any use of it in a regex's sub()
method.
Then, one must think to write ripl()
before any use of it.
记录列表li被定义为默认参数,以便它始终与每次调用函数时使用的列表相同ripl()
(请参阅 undertsand 的默认参数的功能,因为它很微妙)。
作为li
接收默认参数的参数的定义的结果,列表对象li将保留分析多个文本时记录的所有跨度,以防连续分析多个文本。为了避免列表li保留过去文本匹配的跨度,有必要将列表设为空。我编写了该函数,以便使用默认参数定义第一个参数None
:允许调用ripl()
在正则表达式的sub()
方法中使用它之前没有参数。
然后,ripl()
在任何使用它之前必须考虑编写。
.
.
If you want to remove the newlines of the text in order to obtain the precise result you showed in your question, the code must be modified to:
如果您想删除文本的换行符以获得您在问题中显示的精确结果,则必须将代码修改为:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
return ''
elif mat.group(2):
li.append(mat.span(3))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('( *\n *)'
'|'
'</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\2>)) *',
re.DOTALL)
text = '''<something @37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
结果
1------------------------------------1
<something @37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>
2------------------------------------2
George <wxc>WashingtonJoe </zazaza>Taylor
3------------------------------------3
回答by Moe Jan
You can use Beautiful Soup to do this easily:
您可以使用 Beautiful Soup 轻松完成此操作:
from bs4 import BeautifulSoup
file = open('f1.txt')
fixed = open('fnew.txt','w')
#now for some soup
soup = BeautifulSoup(file)
fixed.write(str(soup.get_text()).replace('\n',' '))
The output of the above line will be:
上述行的输出将是:
George Washington Joe Taylor
(Atleast this works with the sample you gave me)
(至少这适用于你给我的样本)
Sorry I don't understand part 2, good luck!
对不起,我不明白第 2 部分,祝你好运!
回答by dustinboettcher
Figured the first part out it was the missing '?'
弄清楚第一部分是缺少的“?”
match = re.compile('<.*?>')
does the trick.
诀窍。
Anyway still not sure about the second questions. :/
无论如何仍然不确定第二个问题。:/
回答by Prahalad Deshpande
For part 1 try the below code snippet. However consider using a library like beautifulsoup as suggested by Moe Jan
对于第 1 部分,请尝试以下代码片段。但是,请考虑使用 Moe Jan 建议的诸如 beautifulsoup 之类的库
import re
import os
def main():
f = open('sample_file.txt')
fixed = open('fnew.txt','w')
#pattern = re.compile(r'(?P<start_tag>\<.+?\>)(?P<content>.*?)(?P<end_tag>\</.+?\>)')
pattern = re.compile(r'(?P<start><.+?>)(?P<content>.*?)(</.+?>)')
output_text = []
for text in f:
match = pattern.match(text)
if match is not None:
output_text.append(match.group('content'))
fixed_content = ' '.join(output_text)
fixed.write(fixed_content)
f.close()
fixed.close()
if __name__ == '__main__':
main()
For part 2:
对于第 2 部分:
I am not completely clear with what you are asking - however my guess is that you want to do something like if re.sub(value) not in []
. However, note that you need to call re.compile
only once prior to initializing the Counter
instance. It would be better if you clarify the second part of your question.
我不完全清楚你在问什么——但是我的猜测是你想做类似的事情if re.sub(value) not in []
。但是,请注意,您只需re.compile
在初始化Counter
实例之前调用一次。如果您澄清问题的第二部分会更好。
Actually, I would recommend you to use the built-in Python diff module to find difference between two files. Using this way better than using your own diff algorithm, since the diff logic is well tested and widely used and is not vulnerable to logical or programmatic errors resulting from presence of spurious newlines, tab and space characters.
实际上,我建议您使用内置的 Python diff 模块来查找两个文件之间的差异。使用这种方式比使用您自己的 diff 算法更好,因为 diff 逻辑经过充分测试并被广泛使用,并且不易受到由于存在虚假换行符、制表符和空格字符而导致的逻辑或编程错误的影响。