比较python中的两个xml文件

Question

提问by sankar

I am new to programming in python,′and i have some troubles understanding the concept. I wish to compare two xml files. These xml files are quite large. I will give an example for the type of files i wish to compare.

我是 Python 编程的新手，我在理解这个概念时遇到了一些麻烦。我想比较两个 xml 文件。这些 xml 文件相当大。我将举例说明我希望比较的文件类型。

xmlfile1:

xmlfile1：

<xml>
    <property1>
          <property2>    
               <property3>

               </property3>
          </property2>    
    </property1>    
</xml>

xml file2:

xml文件2：

<xml>
    <property1>
        <property2>    
            <property3> 
                <property4>

                </property4>    
            </property3>
        </property2>    
    </property1>

</xml>

the property1,property2 that i have named are different from the ones that are actually in the file. There are a lot of properties within the xml file. ANd i wish to compare the two xml files.

我命名的 property1,property2 与文件中实际存在的不同。xml 文件中有很多属性。并且我希望比较两个 xml 文件。

I am using an lxml parser to try to compare the two files and to print out the difference between them.

我正在使用 lxml 解析器来尝试比较两个文件并打印出它们之间的差异。

I do not know how to parse it and compare it automatically.

我不知道如何解析它并自动比较它。

I tried reading through the lxml parser, but i couldnt understand how to use it to my problem.

我尝试通读 lxml 解析器，但我无法理解如何使用它来解决我的问题。

Can someone please tell me how should i proceed with this problem.

有人可以告诉我我应该如何处理这个问题。

Code snippets can be very useful

代码片段可能非常有用

One more question, Am i following the right concept or i am missing something else? Please correct me of any new concepts that you knwo about

还有一个问题，我是遵循正确的概念还是遗漏了其他东西？请纠正我你知道的任何新概念

Answer 1

回答by Nick Bastin

This is actually a reasonably challenging problem (due to what "difference" means often being in the eye of the beholder here, as there will be semantically "equivalent" information that you probably don't want marked as differences).

这实际上是一个相当具有挑战性的问题（由于“差异”的含义通常在旁观者的眼中，因为会有语义上“等效”的信息，您可能不希望将其标记为差异）。

You could try using xmldiff, which is based on work in the paper Change Detection in Hierarchically Structured Information.

您可以尝试使用xmldiff，它基于论文Change Detection in Hierarchically Structured Information 中的工作。

Answer 2

回答by danimirror

My approach to the problem was transforming each XML into a xml.etree.ElementTreeand iterating through each of the layers. I also included the functionality to ignore a list of attributes while doing the comparison.

我解决这个问题的方法是将每个 XML 转换为xml.etree.ElementTree并遍历每个层。我还包括在进行比较时忽略属性列表的功能。

The first block of code holds the class used:

第一个代码块包含使用的类：

import xml.etree.ElementTree as ET
import logging

class XmlTree():

    def __init__(self):
        self.hdlr = logging.FileHandler('xml-comparison.log')
        self.formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')

    @staticmethod
    def convert_string_to_tree( xmlString):

        return ET.fromstring(xmlString)

    def xml_compare(self, x1, x2, excludes=[]):
        """
        Compares two xml etrees
        :param x1: the first tree
        :param x2: the second tree
        :param excludes: list of string of attributes to exclude from comparison
        :return:
            True if both files match
        """

        if x1.tag != x2.tag:
            self.logger.debug('Tags do not match: %s and %s' % (x1.tag, x2.tag))
            return False
        for name, value in x1.attrib.items():
            if not name in excludes:
                if x2.attrib.get(name) != value:
                    self.logger.debug('Attributes do not match: %s=%r, %s=%r'
                                 % (name, value, name, x2.attrib.get(name)))
                    return False
        for name in x2.attrib.keys():
            if not name in excludes:
                if name not in x1.attrib:
                    self.logger.debug('x2 has an attribute x1 is missing: %s'
                                 % name)
                    return False
        if not self.text_compare(x1.text, x2.text):
            self.logger.debug('text: %r != %r' % (x1.text, x2.text))
            return False
        if not self.text_compare(x1.tail, x2.tail):
            self.logger.debug('tail: %r != %r' % (x1.tail, x2.tail))
            return False
        cl1 = x1.getchildren()
        cl2 = x2.getchildren()
        if len(cl1) != len(cl2):
            self.logger.debug('children length differs, %i != %i'
                         % (len(cl1), len(cl2)))
            return False
        i = 0
        for c1, c2 in zip(cl1, cl2):
            i += 1
            if not c1.tag in excludes:
                if not self.xml_compare(c1, c2, excludes):
                    self.logger.debug('children %i do not match: %s'
                                 % (i, c1.tag))
                    return False
        return True

    def text_compare(self, t1, t2):
        """
        Compare two text strings
        :param t1: text one
        :param t2: text two
        :return:
            True if a match
        """
        if not t1 and not t2:
            return True
        if t1 == '*' or t2 == '*':
            return True
        return (t1 or '').strip() == (t2 or '').strip()

The second block of code holds a couple of XML examples and their comparison:

第二个代码块包含几个 XML 示例及其比较：

xml1 = "<note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"

xml2 = "<note><to>Tove</to><from>Daniel</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"

tree1 = XmlTree.convert_string_to_tree(xml1)
tree2 = XmlTree.convert_string_to_tree(xml2)

comparator = XmlTree()

if comparator.xml_compare(tree1, tree2, ["from"]):
    print "XMLs match"
else:
    print "XMLs don't match"

Most of the credit for this code must be given to syawar

此代码的大部分功劳必须归功于 syawar

Answer 3

回答by user3483524

Another script using xml.etree. Its awful but it works :)

另一个使用 xml.etree 的脚本。它很糟糕，但它有效:)

#!/usr/bin/env python

import sys
import xml.etree.ElementTree as ET

from termcolor import colored

tree1 = ET.parse(sys.argv[1])
root1 = tree1.getroot()

tree2 = ET.parse(sys.argv[2])
root2 = tree2.getroot()

class Element:
    def __init__(self,e):
        self.name = e.tag
        self.subs = {}
        self.atts = {}
        for child in e:
            self.subs[child.tag] = Element(child)

        for att in e.attrib.keys():
            self.atts[att] = e.attrib[att]

        print "name: %s, len(subs) = %d, len(atts) = %d" % ( self.name, len(self.subs), len(self.atts) )

    def compare(self,el):
        if self.name!=el.name:
            raise RuntimeError("Two names are not the same")
        print "----------------------------------------------------------------"
        print self.name
        print "----------------------------------------------------------------"
        for att in self.atts.keys():
            v1 = self.atts[att]
            if att not in el.atts.keys():
                v2 = '[NA]'
                color = 'yellow'
            else:
                v2 = el.atts[att]
                if v2==v1:
                    color = 'green'
                else:
                    color = 'red'
            print colored("first:\t%s = %s" % ( att, v1 ), color)
            print colored("second:\t%s = %s" % ( att, v2 ), color)

        for subName in self.subs.keys():
            if subName not in el.subs.keys():
                print colored("first:\thas got %s" % ( subName), 'purple')
                print colored("second:\thasn't got %s" % ( subName), 'purple')
            else:
                self.subs[subName].compare( el.subs[subName] )



e1 = Element(root1)
e2 = Element(root2)

e1.compare(e2)

比较python中的两个xml文件

提问by sankar

回答by Nick Bastin

回答by danimirror

回答by user3483524

相关推荐

最近更新

标签

比较python中的两个xml文件

提问by sankar

回答by Nick Bastin

回答by danimirror

回答by user3483524

相关推荐

Python 使用 pip install Matplotlib 时出现内存错误

Python中的变量替换

Python 存储在元组中的元素总和

是否有不包含任何库的可移植 Python 2.7 Windows 发行版？

相关推荐

最近更新

标签