C# 如何最好地将 XPath 与 .NET 中的非常大的 XML 文件一起使用?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/407350/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 02:11:42  来源:igfitidea点击:

How best to use XPath with very large XML files in .NET?

c#.netxmlxpathlarge-files

提问by glenatron

I need to do some processing on fairly large XML files ( large here being potentially upwards of a gigabyte ) in C# including performing some complex xpath queries. The problem I have is that the standard way I would normally do this through the System.XML libraries likes to load the whole file into memory before it does anything with it, which can cause memory problems with files of this size.

我需要在 C# 中对相当大的 XML 文件(这里的大文件可能超过 1 GB)进行一些处理,包括执行一些复杂的 xpath 查询。我遇到的问题是,我通常通过 System.XML 库执行此操作的标准方法喜欢在整个文件对其进行任何操作之前将其加载到内存中,这可能会导致这种大小的文件出现内存问题。

I don't need to be updating the files at all just reading them and querying the data contained in them. Some of the XPath queries are quite involved and go across several levels of parent-child type relationship - I'm not sure whether this will affect the ability to use a stream reader rather than loading the data into memory as a block.

我根本不需要更新文件,只需读取它们并查询其中包含的数据。一些 XPath 查询非常复杂,涉及多个级别的父子类型关系 - 我不确定这是否会影响使用流读取器而不是将数据作为块加载到内存中的能力。

One way I can see of making it work is to perform the simple analysis using a stream-based approach and perhaps wrapping the XPath statements into XSLT transformations that I could run across the files afterward, although it seems a little convoluted.

我认为让它工作的一种方法是使用基于流的方法执行简单的分析,也许将 XPath 语句包装到 XSLT 转换中,然后我可以在文件中运行,尽管它看起来有点令人费解。

Alternately I know that there are some elements that the XPath queries will not run across, so I guess I could break the document up into a series of smaller fragments based on it's original tree structure, which could perhaps be small enough to process in memory without causing too much havoc.

或者,我知道 XPath 查询不会遇到一些元素,所以我想我可以根据它的原始树结构将文档分解成一系列较小的片段,这些片段可能小到可以在内存中处理而无需造成太大的破坏。

I've tried to explain my objective here so if I'm barking up totally the wrong tree in terms of general approach I'm sure you folks can set me right...

我试图在这里解释我的目标,所以如果我在一般方法方面完全错误的树,我相信你们可以让我正确......

回答by Dirk Vollmar

Have you been trying XPathDocument? This class is optimized for handling XPath queries efficiently.

你一直在尝试 XPathDocument 吗?此类针对有效处理 XPath 查询进行了优化。

If you cannot handle your input documents efficiently using XPathDocument you might consider preprocessing and/or splitting up your input documents using an XmlReader.

如果您不能使用 XPathDocument 有效地处理您的输入文档,您可以考虑使用 XmlReader 预处理和/或拆分您的输入文档。

回答by AnthonyWJones

You've outlined your choices already.

您已经概述了您的选择。

Either you need to abandon the XPath and use XmlTextReader or you need to break the document up into managable chunks on which you can use XPath.

您要么需要放弃 XPath 并使用 XmlTextReader,要么需要将文档分解为可以使用 XPath 的可管理块。

If you choose the latter use XPathDocument its readonly restriction allows better used of memory.

如果您选择后者使用 XPathDocument,它的只读限制可以更好地使用内存。

回答by Darin Dimitrov

In order to perform XPath queries with the standard .NET classes the whole document tree needs to be loaded in memory which might not be a good idea if it can take up to a gigabyte. IMHO the XmlReaderis a nice class for handling such tasks.

为了使用标准的 .NET 类执行 XPath 查询,整个文档树需要加载到内存中,如果它可能占用 1 GB 的内存,这可能不是一个好主意。恕我直言,XmlReader是处理此类任务的好类。

回答by Dimitre Novatchev

It seems that you already tried using XPathDocumentand could not accomodate the parsed xml document in memory.

看来您已经尝试使用 XPathDocument并且无法在内存中容纳解析的 xml 文档

If this is the case, before starting to split the file (which is ultimately the right decision!) you may try using the SaxonXSLT/XQuery processor. It has a very efficient in-memory representation of a loaded XML document (the "tinytree" model). In addition Saxon SA(the shema-aware version, which isn't free) has some streaming extensions. Read more about this here.

如果是这种情况,在开始拆分文件(这最终是正确的决定!)之前,您可以尝试使用SaxonXSLT/XQuery 处理器。它对加载的 XML 文档(“小树”模型)具有非常有效的内存表示。此外,Saxon SA(shema-aware 版本,不是免费的)有一些流媒体扩展在此处阅读更多相关信息。

回答by Donny V.

How about just reading the whole thing into a database and then work with the temp database? That might be better because then your queries can be done more efficiently using TSQL.

将整个内容读入数据库然后使用临时数据库怎么样?这可能会更好,因为这样可以使用 TSQL 更有效地完成查询。

回答by Fortyrunner

Gigabyte XML files! I don't envy you this task.

技嘉 XML 文件!我不羡慕你这个任务。

Is there any way that the files could be sent in a better way? E.g. Are they being sent over the net to you - if they are then a more efficient format might be better for all concerned. Reading the file into a database isn't a bad idea but it could be very time consuming indeed.

有什么办法可以更好地发送文件吗?例如,它们是否通过网络发送给您 - 如果它们是一种更有效的格式,可能对所有相关人员都更好。将文件读入数据库并不是一个坏主意,但它确实可能非常耗时。

I wouldn't try and do it all in memory by reading the entire file - unless you have a 64bit OS and lots of memory. What if the file becomes 2, 3, 4GB?

我不会尝试通过读取整个文件在内存中完成所有操作 - 除非您有 64 位操作系统和大量内存。如果文件变成 2、3、4GB 怎么办?

One other approach could be to read in the XML file and use SAX to parse the file and write out smaller XML files according to some logical split. You could then process thesewith XPath. I've used XPath on 20-30MB files and it is very quick. I was originally going to use SAX but thought I would give XPath a go and was surprised how quick it was. I saved a lot of development time and probably only lost 250ms per query. I was using Java for my parsing but I suspect there would be little difference in .NET.

另一种方法是读入 XML 文件并使用 SAX 解析文件并根据某种逻辑拆分写出较小的 XML 文件。然后您可以使用 XPath处理这些。我在 20-30MB 的文件上使用过 XPath,它非常快。我最初打算使用 SAX,但我想我会试一试 XPath,并惊讶于它的速度如此之快。我节省了大量的开发时间,每个查询可能只损失了 250 毫秒。我使用 Java 进行解析,但我怀疑 .NET 中的差异很小。

I did read that XML::Twig (A Perl CPAN module) was written explicitly to handle SAX based XPath parsing. Can you use a different language?

我确实读到 XML::Twig(一个 Perl CPAN 模块)被明确地编写来处理基于 SAX 的 XPath 解析。你可以使用不同的语言吗?

This might also help https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html

这也可能有帮助https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html

回答by Ahmed Said

I think the best solution is to make your own xml parser that can read small chunks not the whole file, or you can split the large file into small files and use dotnet classes with these files. The problem is you can not parse some of data till the whole data is available so I recommend to use your own parser not dotnet classes

我认为最好的解决方案是制作自己的 xml 解析器,它可以读取小块而不是整个文件,或者您可以将大文件拆分为小文件并使用 dotnet 类处理这些文件。问题是在整个数据可用之前您无法解析某些数据,因此我建议使用您自己的解析器而不是 dotnet 类

回答by Richard Wolf

XPathReader is the answer. It isn't part of the C# runtime, but it is available for download from Microsoft. Here is an MSDN article.

XPathReader 就是答案。它不是 C# 运行时的一部分,但可以从 Microsoft 下载。这是一篇MSDN 文章

If you construct an XPathReader with an XmlTextReader you get the efficiency of a streaming read with the convenience of XPath expressions.

如果您使用 XmlTextReader 构造 XPathReader,您将通过 XPath 表达式的便利获得流式读取的效率。

I haven't used it on gigabyte sized files, but I have used it on files that are tens of megabytes, which is usually enough to slow down DOM based solutions.

我没有在千兆字节大小的文件上使用它,但我已经在数十兆字节的文件上使用了它,这通常足以减慢基于 DOM 的解决方案的速度。

Quoting from the below: "The XPathReader provides the ability to perform XPath over XML documents in a streaming manner".

引用以下内容:“XPathReader 提供了以流方式对 XML 文档执行 XPath 的能力”。

Download from Microsoft

从微软下载

回答by StevenzNPaul

Since in your case the data size can run in Gbs have you considered using ADO.NET with XML as a database. In addition to that the memory footprint would not be huge.

由于在您的情况下,数据大小可以在 Gbs 中运行,因此您是否考虑过使用带有 XML 的 ADO.NET 作为数据库。除此之外,内存占用不会很大。

Another approach would be using Linq to XML with using elements like XElementStream. Hope this helps.

另一种方法是将 Linq to XML 与 XElementStream 之类的元素一起使用。希望这可以帮助。

回答by Paolo Falabella

http://msdn.microsoft.com/en-us/library/bb387013.aspxhas a relevant example leveraging XStreamingElement.

http://msdn.microsoft.com/en-us/library/bb387013.aspx有一个利用 XStreamingElement 的相关示例。