在 C# 代码中解析(大)XML 的最佳方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/676274/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 13:01:09  来源:igfitidea点击:

What is the best way to parse (big) XML in C# Code?

c#xmlxml-serializationxmlreader

提问by corlettk

I'm writing a GIS client tool in C# to retrieve "features" in a GML-based XML schema (sample below) from a server. Extracts are limited to 100,000 features.

我正在用 C# 编写一个 GIS 客户端工具,以从服务器检索基于 GML 的 XML 模式(下面的示例)中的“特征”。提取限制为 100,000 个特征。

I guestimate that the largest extract.xmlmight get up around 150 megabytes, so obviously DOM parsers are out I've been trying to decide between XmlSerializerand XSD.EXEgenerated bindings --OR-- XmlReaderand a hand-crafted object graph.

我guestimate,最大的extract.xml可能起床约150兆字节,所以显然DOM解析器是出我一直在试图决定之间的XmlSerializerXSD.EXE生成绑定-或-的XmlReader和手工制作的对象图。

Or maybe there's a better way which I haven't considered yet? Like XLINQ, or ????

或者也许有更好的方法我还没有考虑过?像 XLINQ,还是 ????

Please can anybody guide me? Especially with regards to the memory efficiency of any given approach. If not I'll have to "prototype" both solutions and profile them side-by-side.

请问有人可以指导我吗?特别是关于任何给定方法的内存效率。如果不是,我将不得不对这两个解决方案进行“原型设计”并并排分析它们。

I'm a bit of a raw prawn in .NET. Any guidance would be greatly appreciated.

我有点像 .NET 中的大虾。任何指导将不胜感激。

Thanking you. Keith.

感谢您。基思。



Sample XML- upto 100,000 of them, of upto 234,600 coords per feature.

示例 XML- 其中最多 100,000 个,每个功能最多 234,600 个坐标。

<feature featId="27168306" fType="vegetation" fTypeId="1129" fClass="vegetation" gType="Polygon" ID="0" cLockNr="51598" metadataId="51599" mdFileId="NRM/TIS/VEGETATION/9543_22_v3" dataScale="25000">
  <MultiGeometry>
    <geometryMember>
      <Polygon>
        <outerBoundaryIs>
          <LinearRing>
            <coordinates>153.505004,-27.42196 153.505044,-27.422015 153.503992 .... 172 coordinates omitted to save space ... 153.505004,-27.42196</coordinates>
          </LinearRing>
        </outerBoundaryIs>
      </Polygon>
    </geometryMember>
  </MultiGeometry>
</feature>

采纳答案by Mitch Wheat

Use XmlReaderto parse large XML documents. XmlReaderprovides fast, forward-only, non-cached access to XML data. (Forward-only means you can read the XML file from beginning to end but cannot move backwards in the file.) XmlReaderuses small amounts of memory, and is equivalent to using a simple SAX reader.

使用XmlReader解析大型XML文档。XmlReader提供对 XML 数据的快速、只进、非缓存访问。(仅向前意味着您可以从头到尾读取 XML 文件,但不能在文件中向后移动。)XmlReader使用少量内存,相当于使用简单的 SAX 读取器。

    using (XmlReader myReader = XmlReader.Create(@"c:\data\coords.xml"))
    {
        while (myReader.Read())
        {
           // Process each node (myReader.Value) here
           // ...
        }
    }

You can use XmlReader to process files that are up to 2 gigabytes (GB) in size.

您可以使用 XmlReader 处理最大为 2 GB 的文件。

Ref: How to read XML from a file by using Visual C#

参考:如何使用 Visual C# 从文件中读取 XML

回答by Andy White

A SAXparser might be what you're looking for. SAX does not require you to read the entire document into memory - it parses through it incrementally and allows you to process the elements as you go. I don't know if there is a SAX parser provided in .NET, but there are a few opensource options that you could look at:

一个SAX解析器可能是你在找什么。SAX 不需要您将整个文档读入内存——它会逐步解析它,并允许您随时处理元素。我不知道 .NET 中是否提供了 SAX 解析器,但您可以查看一些开源选项:

Here's a related post:

这是一个相关的帖子:

回答by corlettk

Just to summarise, and make the answer a bit more obvious for anyone who finds this thread in google.

只是总结一下,让任何在谷歌中找到这个线程的人的答案更加明显。

Prior to .NET 2 the XmlTextReader was the most memory efficient XML parser available in the standard API (thanx Mitch;-)

在 .NET 2 之前,XmlTextReader 是标准 API 中可用的内存效率最高的 XML 解析器(thanx Mitch;-)

.NET 2 introduced the XmlReader class which is better again It's a forward-only element iterator (a bit like a StAX parser). (thanx Cerebrus;-)

.NET 2 引入了 XmlReader 类,它再次变得更好。它是一个只进元素迭代器(有点像 StAX 解析器)。(thanx Cerebrus;-)

And remember kiddies, of any XML instance has the potential to be bigger than about 500k, DON'T USE DOM!

记住孩子们,任何 XML 实例都有可能大于 500k,不要使用 DOM!

Cheers all. Keith.

给大家加油。基思。

回答by corlettk

Asat 14 May 2009: I've switched to using a hybrid approach... see code below.

Asat 2009 年 5 月 14 日:我已改用混合方法...请参阅下面的代码。

This version has most of the advantages of both:
  * the XmlReader/XmlTextReader (memory efficiency --> speed); and
  * the XmlSerializer (code-gen --> development expediancy and flexibility).

此版本具有以下两者的大部分优点:
  * XmlReader/XmlTextReader(内存效率 --> 速度);和
  * XmlSerializer(代码生成 --> 开发便利性和灵活性)。

It uses the XmlTextReader to iterate through the document, and creates "doclets" which it deserializes using the XmlSerializer and "XML binding" classes generated with XSD.EXE.

它使用 XmlTextReader 遍历文档,并创建“doclets”,它使用 XmlSerializer 和由 XSD.EXE 生成的“XML 绑定”类反序列化。

I guess this recipe is universally applicable, and it's fast... I'm parsing a 201 MB XML Document containing 56,000 GML Features in about 7 seconds... the old VB6 implementation of this application took minutes (or even hours) to parse large extracts... so I'm lookin' good to go.

我猜这个秘籍是普遍适用的,而且速度很快……我在大约 7 秒内解析了一个包含 56,000 个 GML 特征的 201 MB XML 文档……这个应用程序的旧 VB6 实现需要几分钟(甚至几小时)来解析大摘录......所以我很高兴去。

Once again, a BIGThank You to the forumites for donating your valuable time. I really appreciate it.

再次,BIG感谢大家对forumites捐出您的宝贵时间。对此,我真的非常感激。

Cheers all. Keith.

给大家加油。基思。

using System;
using System.Reflection;
using System.Xml;
using System.Xml.Serialization;
using System.IO;
using System.Collections.Generic;

using nrw_rime_extract.utils;
using nrw_rime_extract.xml.generated_bindings;

namespace nrw_rime_extract.xml
{
    internal interface ExtractXmlReader
    {
        rimeType read(string xmlFilename);
    }

    /// <summary>
    /// RimeExtractXml provides bindings to the RIME Extract XML as defined by
    /// $/Release 2.7/Documentation/Technical/SCHEMA and DTDs/nrw-rime-extract.xsd
    /// </summary>
    internal class ExtractXmlReader_XmlSerializerImpl : ExtractXmlReader
    {
        private Log log = Log.getInstance();

        public rimeType read(string xmlFilename)
        {
            log.write(
                string.Format(
                    "DEBUG: ExtractXmlReader_XmlSerializerImpl.read({0})",
                    xmlFilename));
            using (Stream stream = new FileStream(xmlFilename, FileMode.Open))
            {
                return read(stream);
            }
        }

        internal rimeType read(Stream xmlInputStream)
        {
            // create an instance of the XmlSerializer class, 
            // specifying the type of object to be deserialized.
            XmlSerializer serializer = new XmlSerializer(typeof(rimeType));
            serializer.UnknownNode += new XmlNodeEventHandler(handleUnknownNode);
            serializer.UnknownAttribute += 
                new XmlAttributeEventHandler(handleUnknownAttribute);
            // use the Deserialize method to restore the object's state
            // with data from the XML document.
            return (rimeType)serializer.Deserialize(xmlInputStream);
        }

        protected void handleUnknownNode(object sender, XmlNodeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Node at line {0} position {1} : {2}\t{3}",
                    e.LineNumber, e.LinePosition, e.Name, e.Text));
        }

        protected void handleUnknownAttribute(object sender, XmlAttributeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Attribute at line {0} position {1} : {2}='{3}'",
                    e.LineNumber, e.LinePosition, e.Attr.Name, e.Attr.Value));
        }

    }

    /// <summary>
    /// xtractXmlReader provides bindings to the extract.xml 
    /// returned by the RIME server; as defined by:
    ///   $/Release X/Documentation/Technical/SCHEMA and 
    /// DTDs/nrw-rime-extract.xsd
    /// </summary>
    internal class ExtractXmlReader_XmlTextReaderXmlSerializerHybridImpl :
        ExtractXmlReader
    {
        private Log log = Log.getInstance();

        public rimeType read(string xmlFilename)
        {
            log.write(
                string.Format(
                    "DEBUG: ExtractXmlReader_XmlTextReaderXmlSerializerHybridImpl." +
                    "read({0})",
                    xmlFilename));

            using (XmlReader reader = XmlReader.Create(xmlFilename))
            {
                return read(reader);
            }

        }

        public rimeType read(XmlReader reader)
        {
            rimeType result = new rimeType();
            // a deserializer for featureClass, feature, etc, "doclets"
            Dictionary<Type, XmlSerializer> serializers = 
                new Dictionary<Type, XmlSerializer>();
            serializers.Add(typeof(featureClassType), 
                newSerializer(typeof(featureClassType)));
            serializers.Add(typeof(featureType), 
                newSerializer(typeof(featureType)));

            List<featureClassType> featureClasses = new List<featureClassType>();
            List<featureType> features = new List<featureType>();
            while (!reader.EOF)
            {
                if (reader.MoveToContent() != XmlNodeType.Element)
                {
                    reader.Read(); // skip non-element-nodes and unknown-elements.
                    continue;
                }

                // skip junk nodes.
                if (reader.Name.Equals("featureClass"))
                {
                    using (
                        StringReader elementReader =
                            new StringReader(reader.ReadOuterXml()))
                    {
                        XmlSerializer deserializer =
                            serializers[typeof (featureClassType)];
                        featureClasses.Add(
                            (featureClassType)
                            deserializer.Deserialize(elementReader));
                    }
                    continue;
                    // ReadOuterXml advances the reader, so don't read again.
                }

                if (reader.Name.Equals("feature"))
                {
                    using (
                        StringReader elementReader =
                            new StringReader(reader.ReadOuterXml()))
                    {
                        XmlSerializer deserializer =
                            serializers[typeof (featureType)];
                        features.Add(
                            (featureType)
                            deserializer.Deserialize(elementReader));
                    }
                    continue;
                    // ReadOuterXml advances the reader, so don't read again.
                }

                log.write(
                    "WARNING: unknown element '" + reader.Name +
                    "' was skipped during parsing.");
                reader.Read(); // skip non-element-nodes and unknown-elements.
            }
            result.featureClasses = featureClasses.ToArray();
            result.features = features.ToArray();
            return result;
        }

        private XmlSerializer newSerializer(Type elementType)
        {
            XmlSerializer serializer = new XmlSerializer(elementType);
            serializer.UnknownNode += new XmlNodeEventHandler(handleUnknownNode);
            serializer.UnknownAttribute += 
                new XmlAttributeEventHandler(handleUnknownAttribute);
            return serializer;
        }

        protected void handleUnknownNode(object sender, XmlNodeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Node at line {0} position {1} : {2}\t{3}",
                    e.LineNumber, e.LinePosition, e.Name, e.Text));
        }

        protected void handleUnknownAttribute(object sender, XmlAttributeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Attribute at line {0} position {1} : {2}='{3}'",
                    e.LineNumber, e.LinePosition, e.Attr.Name, e.Attr.Value));
        }
    }
}

回答by Michael Logutov

Just wanted to add this simple extension method as an example of using XmlReader (as Mitch answered):

只是想添加这个简单的扩展方法作为使用 XmlReader 的示例(正如 Mitch 回答的那样):

public static bool SkipToElement (this XmlReader xmlReader, string elementName)
{
    if (!xmlReader.Read ())
        return false;

    while (!xmlReader.EOF)
    {
        if (xmlReader.NodeType == XmlNodeType.Element && xmlReader.Name == elementName)
            return true;

        xmlReader.Skip ();
    }

    return false;
}

And usage:

和用法:

using (var xml_reader = XmlReader.Create (this.source.Url))
{
    if (!SkipToElement (xml_reader, "Root"))
        throw new InvalidOperationException ("XML element \"Root\" was not found.");

    if (!SkipToElement (xml_reader, "Users"))
        throw new InvalidOperationException ("XML element \"Root/Users\" was not found.");

    ...
}