比较python中的两个xml文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24492895/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Comparing two xml files in python
提问by sankar
I am new to programming in python,′and i have some troubles understanding the concept. I wish to compare two xml files. These xml files are quite large. I will give an example for the type of files i wish to compare.
我是 Python 编程的新手,我在理解这个概念时遇到了一些麻烦。我想比较两个 xml 文件。这些 xml 文件相当大。我将举例说明我希望比较的文件类型。
xmlfile1:
xmlfile1:
<xml>
<property1>
<property2>
<property3>
</property3>
</property2>
</property1>
</xml>
xml file2:
xml文件2:
<xml>
<property1>
<property2>
<property3>
<property4>
</property4>
</property3>
</property2>
</property1>
</xml>
the property1,property2 that i have named are different from the ones that are actually in the file. There are a lot of properties within the xml file. ANd i wish to compare the two xml files.
我命名的 property1,property2 与文件中实际存在的不同。xml 文件中有很多属性。并且我希望比较两个 xml 文件。
I am using an lxml parser to try to compare the two files and to print out the difference between them.
我正在使用 lxml 解析器来尝试比较两个文件并打印出它们之间的差异。
I do not know how to parse it and compare it automatically.
我不知道如何解析它并自动比较它。
I tried reading through the lxml parser, but i couldnt understand how to use it to my problem.
我尝试通读 lxml 解析器,但我无法理解如何使用它来解决我的问题。
Can someone please tell me how should i proceed with this problem.
有人可以告诉我我应该如何处理这个问题。
Code snippets can be very useful
代码片段可能非常有用
One more question, Am i following the right concept or i am missing something else? Please correct me of any new concepts that you knwo about
还有一个问题,我是遵循正确的概念还是遗漏了其他东西?请纠正我你知道的任何新概念
回答by Nick Bastin
This is actually a reasonably challenging problem (due to what "difference" means often being in the eye of the beholder here, as there will be semantically "equivalent" information that you probably don't want marked as differences).
这实际上是一个相当具有挑战性的问题(由于“差异”的含义通常在旁观者的眼中,因为会有语义上“等效”的信息,您可能不希望将其标记为差异)。
You could try using xmldiff, which is based on work in the paper Change Detection in Hierarchically Structured Information.
您可以尝试使用xmldiff,它基于论文Change Detection in Hierarchically Structured Information 中的工作。
回答by danimirror
My approach to the problem was transforming each XML into a xml.etree.ElementTreeand iterating through each of the layers. I also included the functionality to ignore a list of attributes while doing the comparison.
我解决这个问题的方法是将每个 XML 转换为xml.etree.ElementTree并遍历每个层。我还包括在进行比较时忽略属性列表的功能。
The first block of code holds the class used:
第一个代码块包含使用的类:
import xml.etree.ElementTree as ET
import logging
class XmlTree():
def __init__(self):
self.hdlr = logging.FileHandler('xml-comparison.log')
self.formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
@staticmethod
def convert_string_to_tree( xmlString):
return ET.fromstring(xmlString)
def xml_compare(self, x1, x2, excludes=[]):
"""
Compares two xml etrees
:param x1: the first tree
:param x2: the second tree
:param excludes: list of string of attributes to exclude from comparison
:return:
True if both files match
"""
if x1.tag != x2.tag:
self.logger.debug('Tags do not match: %s and %s' % (x1.tag, x2.tag))
return False
for name, value in x1.attrib.items():
if not name in excludes:
if x2.attrib.get(name) != value:
self.logger.debug('Attributes do not match: %s=%r, %s=%r'
% (name, value, name, x2.attrib.get(name)))
return False
for name in x2.attrib.keys():
if not name in excludes:
if name not in x1.attrib:
self.logger.debug('x2 has an attribute x1 is missing: %s'
% name)
return False
if not self.text_compare(x1.text, x2.text):
self.logger.debug('text: %r != %r' % (x1.text, x2.text))
return False
if not self.text_compare(x1.tail, x2.tail):
self.logger.debug('tail: %r != %r' % (x1.tail, x2.tail))
return False
cl1 = x1.getchildren()
cl2 = x2.getchildren()
if len(cl1) != len(cl2):
self.logger.debug('children length differs, %i != %i'
% (len(cl1), len(cl2)))
return False
i = 0
for c1, c2 in zip(cl1, cl2):
i += 1
if not c1.tag in excludes:
if not self.xml_compare(c1, c2, excludes):
self.logger.debug('children %i do not match: %s'
% (i, c1.tag))
return False
return True
def text_compare(self, t1, t2):
"""
Compare two text strings
:param t1: text one
:param t2: text two
:return:
True if a match
"""
if not t1 and not t2:
return True
if t1 == '*' or t2 == '*':
return True
return (t1 or '').strip() == (t2 or '').strip()
The second block of code holds a couple of XML examples and their comparison:
第二个代码块包含几个 XML 示例及其比较:
xml1 = "<note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"
xml2 = "<note><to>Tove</to><from>Daniel</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"
tree1 = XmlTree.convert_string_to_tree(xml1)
tree2 = XmlTree.convert_string_to_tree(xml2)
comparator = XmlTree()
if comparator.xml_compare(tree1, tree2, ["from"]):
print "XMLs match"
else:
print "XMLs don't match"
Most of the credit for this code must be given to syawar
此代码的大部分功劳必须归功于 syawar
回答by user3483524
Another script using xml.etree. Its awful but it works :)
另一个使用 xml.etree 的脚本。它很糟糕,但它有效:)
#!/usr/bin/env python
import sys
import xml.etree.ElementTree as ET
from termcolor import colored
tree1 = ET.parse(sys.argv[1])
root1 = tree1.getroot()
tree2 = ET.parse(sys.argv[2])
root2 = tree2.getroot()
class Element:
def __init__(self,e):
self.name = e.tag
self.subs = {}
self.atts = {}
for child in e:
self.subs[child.tag] = Element(child)
for att in e.attrib.keys():
self.atts[att] = e.attrib[att]
print "name: %s, len(subs) = %d, len(atts) = %d" % ( self.name, len(self.subs), len(self.atts) )
def compare(self,el):
if self.name!=el.name:
raise RuntimeError("Two names are not the same")
print "----------------------------------------------------------------"
print self.name
print "----------------------------------------------------------------"
for att in self.atts.keys():
v1 = self.atts[att]
if att not in el.atts.keys():
v2 = '[NA]'
color = 'yellow'
else:
v2 = el.atts[att]
if v2==v1:
color = 'green'
else:
color = 'red'
print colored("first:\t%s = %s" % ( att, v1 ), color)
print colored("second:\t%s = %s" % ( att, v2 ), color)
for subName in self.subs.keys():
if subName not in el.subs.keys():
print colored("first:\thas got %s" % ( subName), 'purple')
print colored("second:\thasn't got %s" % ( subName), 'purple')
else:
self.subs[subName].compare( el.subs[subName] )
e1 = Element(root1)
e2 = Element(root2)
e1.compare(e2)