C# 比较 XML 节点的高效算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/343667/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficient algorithm for comparing XML nodes
提问by Dirk Vollmar
I want to determine whether two different child nodes within an XML document are equal or not. Two nodes should be considered equal if they have the same set of attributes and child notes and all child notes are equal, too (i.e. the whole sub tree should be equal).
我想确定一个 XML 文档中的两个不同的子节点是否相等。如果两个节点具有相同的属性集和子注释并且所有子注释也相等(即整个子树应该相等),则应认为两个节点相等。
The input document might be very large (up to 60MB, more than a 100000 nodes to compare) and performance is an issue.
输入文档可能非常大(高达 60MB,要比较超过 100000 个节点)并且性能是一个问题。
What would be an efficient way to check for the equality of two nodes?
检查两个节点是否相等的有效方法是什么?
Example:
例子:
<w:p>
<w:pPr>
<w:spacing w:after="120"/>
</w:pPr>
<w:r>
<w:t>Hello</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:spacing w:after="240"/>
</w:pPr>
<w:r>
<w:t>World</w:t>
</w:r>
</w:p>
This XML snippet describes paragraphs in an OpenXML document. The algorithm would be used to determine whether a document contains a paragraph (w:p node) with the same properties (w:pPr node) as another paragraph earlier in the document.
此 XML 片段描述 OpenXML 文档中的段落。该算法将用于确定文档是否包含与文档中的另一个段落具有相同属性(w:pPr 节点)的段落(w:p 节点)。
One idea I have would be to store the nodes' outer XML in a hash set (Normally I would have to get a canonical string representation first where attributes and child notes are sorted always in the same way, but I can expect my nodes already to be in such a form).
我的一个想法是将节点的外部 XML 存储在一个哈希集中(通常我必须首先获得一个规范的字符串表示,其中属性和子注释总是以相同的方式排序,但我可以期望我的节点已经以这样的形式)。
Another idea would be to create an XmlNode object for each node and write a comparer which compares all attributes and child nodes.
另一个想法是为每个节点创建一个 XmlNode 对象并编写一个比较器来比较所有属性和子节点。
My environment is C# (.Net 2.0); any feedback and further ideas are very welcome. Maybe somebody even has already a good solution?
我的环境是C#(.Net 2.0);非常欢迎任何反馈和进一步的想法。也许有人甚至已经有一个很好的解决方案?
EDIT: Microsoft's XmlDiff API can actually do that but I was wondering whether there would be a more lightweight approach. XmlDiff seems to always produce a diffgram and to always produce a canonical node representation first, both things which I don't need.
编辑:微软的 XmlDiff API 实际上可以做到这一点,但我想知道是否会有更轻量级的方法。XmlDiff 似乎总是产生一个 diffgram 并且总是首先产生一个规范的节点表示,这两个东西我都不需要。
EDIT2: I finally implemented my own XmlNodeEqualityComparer based on the suggestion made here. Thanks a lot!!!!
EDIT2:我终于根据这里提出的建议实现了我自己的 XmlNodeEqualityComparer。非常感谢!!!!
Thanks, divo
谢谢,迪沃
采纳答案by Dave R.
I'd recommend against rolling your own hash creation function and instead rely on the in-built XNodeEqualityComparer
's GetHashCode
method. This guarantees to take account of attributes and descendant nodes when creating the result and could save you some time too.
我建议不要滚动你自己的哈希创建函数,而是依赖内置XNodeEqualityComparer
的GetHashCode
方法。这保证在创建结果时考虑属性和后代节点,也可以为您节省一些时间。
Your code would look like the following:
您的代码如下所示:
XNodeEqualityComparer comparer = new XNodeEqualityComparer();
XDocument doc = XDocument.Load("XmlFile1.xml");
Dictionary<int, XNode> nodeDictionary = new Dictionary<int, XNode>();
foreach (XNode node in doc.Elements("doc").Elements("node"))
{
int hash = comparer.GetHashCode(node);
if (nodeDictionary.ContainsKey(hash))
{
// A duplicate has been found. Execute your logic here
// ...
}
else
{
nodeDictionary.Add(hash, node);
}
}
My XmlFile1.xml is:
我的 XmlFile1.xml 是:
<?xml version="1.0" encoding="utf-8" ?>
<doc>
<node att="A">Blah</node>
<node att="A">Blah</node>
<node att="B">
<inner>Innertext</inner>
</node>
<node>Blah</node>
<node att="B">
<inner>Different</inner>
</node>
</doc>
nodeDictionary
will end up containing a unique collection of Nodes and their hashes. Duplicates are detected by using the Dictionary
's ContainsKey
method, passing in the hash of the node, which we generate using the XNodeEqualityComparer
's GetHashCode
method.
nodeDictionary
将最终包含节点及其哈希的唯一集合。通过使用Dictionary
'sContainsKey
方法检测重复项,传入我们使用XNodeEqualityComparer
'sGetHashCode
方法生成的节点的哈希值。
I think this should be fast enough for your needs.
我认为这应该足以满足您的需求。
回答by PW.
回答by Tomalak
What about this approach:
这种方法怎么样:
For all <w:pPr>
nodes in the document (I suppose there is not more than one per <w:p>
), concatenate all relevant data (element names, attributes, values) into a string:
对于<w:pPr>
文档中的所有节点(我想每个节点不超过一个<w:p>
),将所有相关数据(元素名称、属性、值)连接成一个字符串:
// string format is really irrelevant, so this is just a bogus example
'!w:keep-with-next@value="true"!w:spacing@w:before="10"@w:after="120"'
Do so on alphabetical order, to account for varying document order.
按字母顺序执行此操作,以考虑不同的文档顺序。
Build a collection using these strings as the key and the reference to the respective <w:p>
node as the value.
使用这些字符串作为键和对相应<w:p>
节点的引用作为值构建一个集合。
In the process of doing this, when you hit the point that a given key already exists in the collection, you found a paragraph with the same properties. Work with a list of nodes as the collection value, if you want to keep collecting.
在执行此操作的过程中,当您到达集合中已存在给定键时,您会发现具有相同属性的段落。如果您想继续收集,请使用节点列表作为收集值。
I can't say how well this would perform, but I guess it is not too hard to implement and find out.
我不能说这会有多好,但我想实施和找出它并不太难。
回答by Dimitre Novatchev
It is very challenging even to define correctly the problem of
即使正确定义问题也非常具有挑战性
"When two xml documents are equal?"
“当两个 xml 文档相等时?”
There are many reasons for this:
这件事情是由很多原因导致的:
- An XML document is a tree that may have different textual representations.
- Whitespace-only nodes may or may not be considered in a comparison
- Comment nodes may or may not be considered in a comparison
- PI nodes may or may not be considered in a comparison
- Lexical differences: or
- Different prefixes may be associated with the same namespace in the two documents
- A namespace node may be shown as defined on a node of doc1 and as not defined but inherited from the parent of the corresponding node in doc2
- Quotes may be used around an attribute in doc1 but apostrophes may be used in doc2
- Entities may be used in doc1 but they may be pre-expanded in doc2
- The two documents may have different but semantically equivalent DTDs
- Etc.
- XML 文档是一棵树,可能具有不同的文本表示。
- 在比较中可能会或可能不会考虑仅空白节点
- 在比较中可能会或可能不会考虑评论节点
- 在比较中可能会或可能不会考虑 PI 节点
- 词汇差异:或
- 不同的前缀可能与两个文档中的同一个命名空间相关联
- 命名空间节点可以显示为在 doc1 的节点上定义,但未定义但继承自 doc2 中相应节点的父节点
- 可以在 doc1 中的属性周围使用引号,但可以在 doc2 中使用撇号
- 实体可以在 doc1 中使用,但它们可以在 doc2 中预先展开
- 两个文档可能有不同但语义相同的 DTD
- 等等。
Therefore it seems naive and unrealistic to try to produce a correct implementation of a function for equality comparison of two XML documents.
因此,试图为两个 XML 文档的相等性比较生成一个函数的正确实现似乎是幼稚和不切实际的。
My recommendation isto use the deep-equal()function with a compliant XPath 2.0 engine.
我的建议是将deep-equal()函数与兼容的 XPath 2.0 引擎一起使用。
回答by ICR
Here is a hash function I have knocked up that attempts to solve a part of your problem. Note that I have very little experience writing hash functions, and have included it mainly to get feedback from people as to it's effectiveness in solving this particular problem. I would not recommend it's use in production.
这是我提出的一个哈希函数,它试图解决您的部分问题。请注意,我几乎没有编写散列函数的经验,将它包含在内主要是为了从人们那里获得关于它在解决这个特定问题方面的有效性的反馈。我不建议在生产中使用它。
static int HashXElement(XElement elem)
{
int hash = 23;
foreach (XAttribute attrib in elem.Attributes())
{
int attribHash = 23;
attribHash = attribHash * 37 + attrib.Name.GetHashCode();
attribHash = attribHash * 37 + attrib.Value.GetHashCode();
hash = hash ^ attribHash;
}
foreach(XElement subElem in elem.Descendants())
{
hash = hash * 37 + XmlHash(subElem);
}
hash = hash * 37 + elem.Value.GetHashCode();
return hash;
}
The ideas was to make the ordering of subnodes significant, but the ordering of attributes not significant.
想法是使子节点的顺序重要,但属性的顺序不重要。