xml 如何拆分大型xml文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4325823/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 13:24:25  来源:igfitidea点击:

How do I split a large xml file?

xmlwindows

提问by Ian Ringrose

We export “records” to an xml file; one of our customers has complained that the file is too big for their other system to process. Therefore I need to split up the file, while repeating the “header section” in each of the new files.

我们将“记录”导出到 xml 文件;我们的一位客户抱怨文件太大,他们的其他系统无法处理。因此,我需要拆分文件,同时在每个新文件中重复“标题部分”。

So I am looking for something that will let me define some xpaths for the section(s) that should always be outputted, and another xpath for the “rows” with a parameter that says how many rows to put in each file and how to name the files.

所以我正在寻找可以让我为应该始终输出的部分定义一些 xpath 的东西,以及为“行”定义另一个 xpath 的参数,该参数说明在每个文件中放置多少行以及如何命名文件。

Before I start writing some custom .net code to do this; is there a standard command line tool that will work on windows that does it?

在我开始编写一些自定义 .net 代码来执行此操作之前;是否有一个标准的命令行工具可以在 Windows 上运行

(As I know how to program in C#, I am more included to write code then try to mess about with complex xsl etc, but a "of the self" solution would be better then custom code.)

(因为我知道如何在 C# 中编程,我更多地参与编写代码然后尝试使用复杂的 xsl 等,但“自我”解决方案会比自定义代码更好。)

采纳答案by bill seacham

"is there a standard command line tool that will work on windows that does it?"

“是否有一个标准的命令行工具可以在 Windows 上运行?”

Yes. http://xponentsoftware.com/xmlSplit.aspx

是的。http://xponentsoftware.com/xmlSplit.aspx

回答by Robert Rossney

There's no general-purpose solution to this, because there's so many different possible ways that your source XML could be structured.

对此没有通用的解决方案,因为可以通过多种不同的方式来构建源 XML。

It's reasonably straightforward to build an XSLT transform that will output a slice of an XML document. For instance, given this XML:

构建将输出 XML 文档切片的 XSLT 转换相当简单。例如,给定这个 XML:

<header>
  <data rec="1"/>
  <data rec="2"/>
  <data rec="3"/>
  <data rec="4"/>
  <data rec="5"/>
  <data rec="6"/>
</header>

you can output a copy of the file containing only dataelements within a certain range with this XSLT:

您可以data使用此 XSLT输出仅包含特定范围内元素的文件副本:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>
  <xsl:param name="startPosition"/>
  <xsl:param name="endPosition"/>

  <xsl:template match="@* | node()">
      <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
      </xsl:copy> 
  </xsl:template>

  <xsl:template match="header">
    <xsl:copy>
      <xsl:apply-templates select="data"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="data">
    <xsl:if test="position() &gt;= $startPosition and position() &lt;= $endPosition">
      <xsl:copy>
        <xsl:apply-templates select="@* | node()"/>
      </xsl:copy>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

(Note, by the way, that because this is based on the identity transform, it works even if headerisn't the top-level element.)

(请注意,顺便说一下,因为这是基于身份转换,所以即使header不是顶级元素,它也能工作。)

You still need to count the dataelements in the source XML, and run the transform repeatedly with the values of $startPositionand $endPositionthat are appropriate for the situation.

您仍然需要对data源 XML 中的元素进行计数,并使用适合情况的$startPosition和值重复运行转换$endPosition

回答by ewroman

First download foxe xml editor from this link http://www.firstobject.com/foxe242.zip

首先从这个链接下载 foxe xml 编辑器http://www.firstobject.com/foxe242.zip

Watch that video http://www.firstobject.com/xml-splitter-script-video.htmVideo explains how split code works.

观看该视频http://www.firstobject.com/xml-splitter-script-video.htm视频解释了拆分代码的工作原理。

There is a script code on that page (starts with split()) copy the code and on the xml editor program make a "New Program" under the "File". Paste the code and save it. The code is:

该页面上有一个脚本代码(以 开头split())复制代码并在 xml 编辑器程序上在“文件”下创建一个“新程序”。粘贴代码并保存。代码是:

split()
{
  CMarkup xmlInput, xmlOutput;
  xmlInput.Open( "**50MB.xml**", MDF_READFILE );
  int nObjectCount = 0, nFileCount = 0;
  while ( xmlInput.FindElem("//**ACT**") )
  {
    if ( nObjectCount == 0 )
    {
      ++nFileCount;
      xmlOutput.Open( "**piece**" + nFileCount + ".xml", MDF_WRITEFILE );
      xmlOutput.AddElem( "**root**" );
      xmlOutput.IntoElem();
    }
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
    ++nObjectCount;
    if ( nObjectCount == **5** )
    {
      xmlOutput.Close();
      nObjectCount = 0;
    }
  }
  if ( nObjectCount )
    xmlOutput.Close();
  xmlInput.Close();
  return nFileCount;
}

Change the bold marked (or ** ** marked) fields for your needs. (this is also expressed at the video page)

根据需要更改粗体标记(或 ** ** 标记)字段。(这个也在视频页面有表达)

On the xml editor window right click and click the RUN (or simply F9). There is output bar on the window where it shows number of files that generated.

在 xml 编辑器窗口中右键单击并单击 RUN(或直接按 F9)。窗口上有输出栏,显示生成的文件数。

Note: input File name can be "C:\\Users\\AUser\\Desktop\\a_xml_file.xml"(double slashes) and output file "C:\\Users\\AUser\\Desktop\\anoutputfolder\\piece" + nFileCount + ".xml"

注意:输入文件名可以是"C:\\Users\\AUser\\Desktop\\a_xml_file.xml"(双斜线)和输出文件"C:\\Users\\AUser\\Desktop\\anoutputfolder\\piece" + nFileCount + ".xml"

回答by loomi

As mentioned already the xml_splitfrom the Perl packageXML::Twigdoes a great job.

正如已经提到的,xml_split来自 Perl包的XML::Twig做得很好。

Usage

用法

xml_split < bigFile.xml

#or if compressed e.g.
bzcat bigFile.xml.bz2 | xml_split

Without any arguments xml_splitcreates a file per top-level child node.

没有任何参数xml_split为每个顶级子节点创建一个文件。

There are parametersto specify the number of elements you want per file (-g) or approximate size (-s <Kb|Mb|Gb>).

有一些参数可以指定每个文件所需的元素数 ( -g) 或近似大小 ( -s <Kb|Mb|Gb>)。

Installation

安装

Windows

视窗

Look here

看这里

Linux

Linux

sudo apt-get install xml-twig-tools

sudo apt-get install xml-twig-tools

回答by Gfy

xml_split - split huge XML documents into smaller chunks

xml_split - 将巨大的 XML 文档拆分成更小的块

http://www.perlmonks.org/index.pl?node_id=429707

http://www.perlmonks.org/index.pl?node_id=429707

http://metacpan.org/pod/XML::Twig

http://metacpan.org/pod/XML::Twig

回答by Oded

There is nothing built in that can handle this situation easily.

没有任何内置的东西可以轻松处理这种情况。

Your approach sounds reasonable, though I would probably start with a "skeleton" document containing the elements that need to be repeated and generate several documents with the "records".

您的方法听起来很合理,但我可能会从包含需要重复的元素的“骨架”文档开始,并使用“记录”生成多个文档。



Update:

更新:

After a bit of digging, I found thisarticle describing a way to split files using XSLT.

经过一番挖掘,我发现这篇文章描述了一种使用 XSLT 拆分文件的方法。

回答by Steve Black

Using Ultraedit based on https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704

基于https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704使用 Ultraedit

All I added was some XML header and footer bits The first and last file need to be manually fixed (or remove the root element from your source).

我添加的只是一些 XML 页眉和页脚位 第一个和最后一个文件需要手动修复(或从源中删除根元素)。

    // from https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704 

var FoundsPerFile = 200;      // Global setting for number of found split strings per file.
var SplitString = "</letter>";  // String where to split. The split occurs after next character.
var xmlHead = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>';
var xmlRootStart = '<letters xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" letterCode="OA01" >';
var xmlRootEnd = '</letters>';

/* Find the tab index of the active document */
// Copied from http://www.ultraedit.com/forums/viewtopic.php?t=4571
function getActiveDocumentIndex () {
   var tabindex = -1; /* start value */

   for (var i = 0; i < UltraEdit.document.length; i++)
   {
      if (UltraEdit.activeDocument.path==UltraEdit.document[i].path) {
         tabindex = i;
         break;
      }
   }
   return tabindex;
}

if (UltraEdit.document.length) { // Is any file open?
   // Set working environment required for this job.
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();
   UltraEdit.activeDocument.hexOff();
   UltraEdit.ueReOn();

   // Move cursor to top of active file and run the initial search.
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=false;
   // If the string to split is not found in this file, do nothing.
   if (UltraEdit.activeDocument.findReplace.find(SplitString)) {
      // This file is probably the correct file for this script.
      var FileNumber = 1;    // Counts the number of saved files.
      var StringsFound = 1;  // Counts the number of found split strings.
      var NewFileIndex = UltraEdit.document.length;
      /* Get the path of the current file to save the new
         files in the same directory as the current file. */
      var SavePath = "";
      var LastBackSlash = UltraEdit.activeDocument.path.lastIndexOf("\");
      if (LastBackSlash >= 0) {
         LastBackSlash++;
         SavePath = UltraEdit.activeDocument.path.substring(0,LastBackSlash);
      }
      /* Get active file index in case of more than 1 file is open and the
         current file does not get back the focus after closing the new files. */
      var FileToSplit = getActiveDocumentIndex();
      // Always use clipboard 9 for this script and not the Windows clipboard.
      UltraEdit.selectClipboard(9);
      // Split the file after every x found split strings until source file is empty.
      while (1) {
         while (StringsFound < FoundsPerFile) {
            if (UltraEdit.document[FileToSplit].findReplace.find(SplitString)) StringsFound++;
            else {
               UltraEdit.document[FileToSplit].bottom();
               break;
            }
         }
         // End the selection of the find command.
         UltraEdit.document[FileToSplit].endSelect();
         // Move the cursor right to include the next character and unselect the found string.
         UltraEdit.document[FileToSplit].key("RIGHT ARROW");
         // Select from this cursor position everything to top of the file.
         UltraEdit.document[FileToSplit].selectToTop();
         // Is the file not already empty?
         if (UltraEdit.document[FileToSplit].isSel()) {
            // Cut the selection and paste it into a new file.
            UltraEdit.document[FileToSplit].cut();
            UltraEdit.newFile();
            UltraEdit.document[NewFileIndex].setActive();
            UltraEdit.activeDocument.paste();


            /* Add line termination on the last line and remove automatically added indent
               spaces/tabs if auto-indent is enabled if the last line is not already terminated. */
            if (UltraEdit.activeDocument.isColNumGt(1)) {
               UltraEdit.activeDocument.insertLine();
               if (UltraEdit.activeDocument.isColNumGt(1)) {
                  UltraEdit.activeDocument.deleteToStartOfLine();
               }
            }

            // add headers and footers 

            UltraEdit.activeDocument.top();
            UltraEdit.activeDocument.write(xmlHead);
                        UltraEdit.activeDocument.write(xmlRootStart);
            UltraEdit.activeDocument.bottom();
            UltraEdit.activeDocument.write(xmlRootEnd);
            // Build the file name for this new file.
            var SaveFileName = SavePath + "LETTER";
            if (FileNumber < 10) SaveFileName += "0";
            SaveFileName += String(FileNumber) + ".raw.xml";
            // Save the new file and close it.
            UltraEdit.saveAs(SaveFileName);
            UltraEdit.closeFile(SaveFileName,2);
            FileNumber++;
            StringsFound = 0;
            /* Delete the line termination in the source file
               if last found split string was at end of a line. */
            UltraEdit.document[FileToSplit].endSelect();
            UltraEdit.document[FileToSplit].key("END");
            if (UltraEdit.document[FileToSplit].isColNumGt(1)) {
               UltraEdit.document[FileToSplit].top();
            } else {
               UltraEdit.document[FileToSplit].deleteLine();
            }
         } else break;
            UltraEdit.outputWindow.write("Progress " + SaveFileName);
      }  // Loop executed until source file is empty!

      // Close source file without saving and re-open it.
      var NameOfFileToSplit = UltraEdit.document[FileToSplit].path;
      UltraEdit.closeFile(NameOfFileToSplit,2);
      /* The following code line could be commented if the source
         file is not needed anymore for further actions. */
      UltraEdit.open(NameOfFileToSplit);

      // Free memory and switch back to Windows clipboard.
      UltraEdit.clearClipboard();
      UltraEdit.selectClipboard(0);
   }
}