大文件的 XML 拆分

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/700213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 12:25:39  来源:igfitidea点击:

XML Split of a Large file

xml

提问by sameer karjatkar

I have a 15 GB XML file which I would want to split it .It has approximately 300 Million lines in it . It doesn't have any top nodes which are interdependent .Is there any tool available which readily does this for me ?

我有一个 15 GB 的 XML 文件,我想将其拆分。其中大约有 3 亿行。它没有任何相互依赖的顶级节点。是否有任何可用的工具可以轻松地为我做到这一点?

采纳答案by Cerebrus

I think you'll have to split manually unless you are interested in doing it programmatically. Here's a samplethat does that, though it doesn't mention the max size of handled XML files. When doing it manually, the first problem that arises is how to open the file itself.

我认为除非您有兴趣以编程方式进行拆分,否则您必须手动拆分。这是一个这样做的示例,尽管它没有提到处理的 XML 文件的最大大小。手动执行时,出现的第一个问题是如何打开文件本身。

I would recommend a very simple text editor - something like Vim. When handling such large files, it is always useful to turn off all forms of syntax highlighting and/or folding.

我会推荐一个非常简单的文本编辑器——比如Vim。在处理如此大的文件时,关闭所有形式的语法突出显示和/或折叠总是有用的。

Other options worth considering:

其他值得考虑的选择:

  1. EditPadPro- I've never tried it with anything this size, but if it's anything like other JGSoft products, it should work like a breeze. Remember to turn off syntax highlighting.

  2. VEdit- I've used this with files of 1GB in size, works as if it were nothing at all.

  3. EmEditor

  1. EditPadPro- 我从未尝试过使用这种尺寸的任何东西,但如果它与其他 JGSoft 产品一样,它应该可以轻松工作。记得关闭语法高亮。

  2. VEdit- 我已经将它与 1GB 大小的文件一起使用,就像它什么都没有一样。

  3. EmEditor

回答by Gfy

XmlSplit - A Command-line Tool That Splits Large XML Files

XmlSplit - 拆分大型 XML 文件的命令行工具

xml_split - split huge XML documents into smaller chunks

xml_split - 将巨大的 XML 文档拆分成更小的块

Split that XML by bhayanakmaut (No source code and I could not get this one working)

通过 bhayanakmaut 拆分 XML(没有源代码,我无法让这个工作)

A similar question: How do I split a large xml file?

一个类似的问题:如何拆分大型 xml 文件?

回答by eleg

QXMLEdit has a dedicated function for that: I used it successfully with a Wikipedia dump. The ~2.7Gio file became a bunch of ~1 400 000 files (one per page). It even allows you to dispatch them in subfolders.

QXMLEdit 有一个专门的功能:我成功地将它与维基百科转储一起使用。~2.7Gio 文件变成了一堆 ~1 400 000 个文件(每页一个)。它甚至允许您将它们分派到子文件夹中。

回答by Ben Bryant

Here is a low memory footprint script to do it in the free firstobject XML editor (foxe) using CMarkup file mode. I am not sure what you mean by no interdependent top nodes, or tag checking, but assuming under the root element you have millions of top level elements containing object properties or rows that each need to be kept together as a unit, and you wanted say 1 million per output file, you could do this:

这是一个使用 CMarkup 文件模式在免费的 firstobject XML 编辑器 (foxe) 中执行此操作的低内存占用脚本。我不确定你所说的没有相互依赖的顶级节点或标签检查是什么意思,但假设在根元素下你有数百万个包含对象属性或行的顶级元素,每个元素都需要作为一个单元保存在一起,你想说每个输出文件 100 万,你可以这样做:

split_xml_15GB()
{
  int nObjectCount = 0, nFileCount = 0;
  CMarkup xmlInput, xmlOutput;
  xmlInput.Open( "15GB.xml", MDF_READFILE );
  xmlInput.FindElem(); // root
  str sRootTag = xmlInput.GetTagName();
  xmlInput.IntoElem();
  while ( xmlInput.FindElem() )
  {
    if ( nObjectCount == 0 )
    {
      ++nFileCount;
      xmlOutput.Open( "piece" + nFileCount + ".xml", MDF_WRITEFILE );
      xmlOutput.AddElem( sRootTag );
      xmlOutput.IntoElem();
    }
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
    ++nObjectCount;
    if ( nObjectCount == 1000000 )
    {
      xmlOutput.Close();
      nObjectCount = 0;
    }
  }
  if ( nObjectCount )
    xmlOutput.Close();
  xmlInput.Close();
  return nFileCount;
}

I posted a youtube video and article about this here:

我在这里发布了一个关于这个的 YouTube 视频和文章:

http://www.firstobject.com/xml-splitter-script-video.htm

http://www.firstobject.com/xml-splitter-script-video.htm

回答by mat_geek

The open source library comma has several tools to find data in very large XMl files and to split those files into smaller files.

开源库comma 有多种工具可以在非常大的XMl 文件中查找数据并将这些文件拆分为较小的文件。

https://github.com/acfr/comma/wiki/XML-Utilities

https://github.com/acfr/comma/wiki/XML-Utilities

The tools were built using the expat SAX parser so that they did not fill memory with a DOM tree like xmlstarlet and saxon.

这些工具是使用 expat SAX 解析器构建的,因此它们不会用 xmlstarlet 和 saxon 之类的 DOM 树填充内存。

回答by Shivendra

Used this for splitting Yahoo Q&A dataset

    count = 0
    file_count = 1
    with open('filepath') as f:

    current_file = ""

    for line in f:
        current_file = current_file + line

        if "</your tag to split>" in line:
            count = count + 1

        if count==50000:
            current_file = current_file + "</endTag>"
            with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
                split.write(current_file)
            file_count = file_count + 1
            current_file = "<?xml version='1.0' encoding='UTF-8'?>\n<endTag>"
            count = 0

current_file = current_file + "</endTag>"
with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
    split.write(current_file)

回答by Farid

I used XmlSplit Wizard tool. It really work nicely and you can specify the split method like element, rows, number of files, or the size of files. The only problem is that I had to buy it for 99$ as the trial version wont allow you to all split data, only odd number of divided files. I was able to split a 70GB file !

我使用了 XmlSplit 向导工具。它确实工作得很好,您可以指定拆分方法,例如元素、行、文件数或文件大小。唯一的问题是我不得不以 99 美元的价格购买它,因为试用版不允许您分割所有数据,只能分割奇数个文件。我能够拆分 70GB 的文件!

回答by user11106941

Perhaps this question is actual still and I believe it can help somebody. There is an xml editor XiMpLewhich contains a tool for splitting big files. Only fragment size is required. And there is also reverse functionality to link xml files together(!). It's free for non-commercial use and the license is not expensive too. No installation is required. For me it worked very good (I had 5GB file).

也许这个问题仍然存在,我相信它可以帮助某人。有一个 xml 编辑器XiMpLe,其中包含一个用于拆分大文件的工具。只需要片段大小。并且还有将 xml 文件链接在一起的反向功能(!)。它可免费用于非商业用途,而且许可证也不贵。无需安装。对我来说它工作得很好(我有 5GB 文件)。

回答by John Saunders

In what way do you need to split it? It's pretty easy to write code using XmlReader.ReadSubTree. It will return a new xmlReader instance against the current element and all its child elements. So, move to the first child of the root, call ReadSubtree, write all those nodes, call Read() using the original reader, and loop until done.

你需要以什么方式拆分它?使用XmlReader.ReadSubTree. 它将针对当前元素及其所有子元素返回一个新的 xmlReader 实例。因此,移动到根的第一个子节点,调用 ReadSubtree,写入所有这些节点,使用原始读取器调用 Read(),然后循环直到完成。

回答by MrTelly

Not an Xml tool but Ultraeditcould probably help, I've used it with 2G files and it didn't mind at all, make sure you turn off the auto-backup feature though.

不是 Xml 工具,但Ultraedit可能会有所帮助,我已经将它与 2G 文件一起使用,它根本不介意,但请确保关闭自动备份功能。