用 PHP 解析巨大的 XML 文件

Question

提问by Ian

I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?

我正在尝试将 DMOZ 内容/结构 XML 文件解析为 MySQL，但是所有现有的执行此操作的脚本都非常旧并且无法正常工作。如何在 PHP 中打开一个大（+1GB）的 XML 文件进行解析？

Answer 1

回答by Emil H

There are only two php APIs that are really suited for processing large files. The first is the old expatapi, and the second is the newer XMLreaderfunctions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).

只有两个 php API 真正适合处理大文件。第一个是旧的expatapi，第二个是新的XMLreader函数。这些 api 读取连续流而不是将整个树加载到内存中（这就是 simplexml 和 DOM 所做的）。

For an example, you might want to look at this partial parser of the DMOZ-catalog:

例如，您可能想查看 DMOZ 目录的这个部分解析器：

<?php

class SimpleDMOZParser
{
    protected $_stack = array();
    protected $_file = "";
    protected $_parser = null;

    protected $_currentId = "";
    protected $_current = "";

    public function __construct($file)
    {
        $this->_file = $file;

        $this->_parser = xml_parser_create("UTF-8");
        xml_set_object($this->_parser, $this);
        xml_set_element_handler($this->_parser, "startTag", "endTag");
    }

    public function startTag($parser, $name, $attribs)
    {
        array_push($this->_stack, $this->_current);

        if ($name == "TOPIC" && count($attribs)) {
            $this->_currentId = $attribs["R:ID"];
        }

        if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
            echo $attribs["R:RESOURCE"] . "\n";
        }

        $this->_current = $name;
    }

    public function endTag($parser, $name)
    {
        $this->_current = array_pop($this->_stack);
    }

    public function parse()
    {
        $fh = fopen($this->_file, "r");
        if (!$fh) {
            die("Epic fail!\n");
        }

        while (!feof($fh)) {
            $data = fread($fh, 4096);
            xml_parse($this->_parser, $data, feof($fh));
        }
    }
}

$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

Answer 2

回答by oskarth

This is a very similar question to Best way to process large XML in PHPbut with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing. However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:

这是一个与在 PHP 中处理大型 XML 的最佳方法非常相似的问题，但有一个非常好的具体答案，解决了 DMOZ 目录解析的具体问题。但是，由于这对于一般的大型 XML 来说是一个很好的谷歌命中，我也会重新发布我从另一个问题中得到的答案：

My take on it:

我的看法：

https://github.com/prewk/XmlStreamer

A simple class that will extract all children to the XML root element while streaming the file. Tested on 108 MB XML file from pubmed.com.

一个简单的类，它将在流式传输文件时将所有子元素提取到 XML 根元素。测试来自 pubmed.com 的 108 MB XML 文件。

class SimpleXmlStreamer extends XmlStreamer {
    public function processNode($xmlString, $elementName, $nodeIndex) {
        $xml = simplexml_load_string($xmlString);

        // Do something with your SimpleXML object

        return true;
    }
}

$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();

Answer 3

回答by oskarth

I've recently had to parse some pretty large XML documents, and needed a method to read one element at a time.

我最近不得不解析一些非常大的 XML 文档，并且需要一种方法来一次读取一个元素。

If you have the following file complex-test.xml:

如果您有以下文件complex-test.xml：

<?xml version="1.0" encoding="UTF-8"?>
<Complex>
  <Object>
    <Title>Title 1</Title>
    <Name>It's name goes here</Name>
    <ObjectData>
      <Info1></Info1>
      <Info2></Info2>
      <Info3></Info3>
      <Info4></Info4>
    </ObjectData>
    <Date></Date>
  </Object>
  <Object></Object>
  <Object>
    <AnotherObject></AnotherObject>
    <Data></Data>
  </Object>
  <Object></Object>
  <Object></Object>
</Complex>

And wanted to return the <Object/>s

并想返回<Object/>s

PHP:

PHP：

require_once('class.chunk.php');

$file = new Chunk('complex-test.xml', array('element' => 'Object'));

while ($xml = $file->read()) {
  $obj = simplexml_load_string($xml);
  // do some parsing, insert to DB whatever
}

###########
Class File
###########

<?php
/**
 * Chunk
 * 
 * Reads a large file in as chunks for easier parsing.
 * 
 * The chunks returned are whole <$this->options['element']/>s found within file.
 * 
 * Each call to read() returns the whole element including start and end tags.
 * 
 * Tested with a 1.8MB file, extracted 500 elements in 0.11s
 * (with no work done, just extracting the elements)
 * 
 * Usage:
 * <code>
 *   // initialize the object
 *   $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
 *   
 *   // loop through the file until all lines are read
 *   while ($xml = $file->read()) {
 *     // do whatever you want with the string
 *     $o = simplexml_load_string($xml);
 *   }
 * </code>
 * 
 * @package default
 * @author Dom Hastings
 */
class Chunk {
  /**
   * options
   *
   * @var array Contains all major options
   * @access public
   */
  public $options = array(
    'path' => './',       // string The path to check for $file in
    'element' => '',      // string The XML element to return
    'chunkSize' => 512    // integer The amount of bytes to retrieve in each chunk
  );

  /**
   * file
   *
   * @var string The filename being read
   * @access public
   */
  public $file = '';
  /**
   * pointer
   *
   * @var integer The current position the file is being read from
   * @access public
   */
  public $pointer = 0;

  /**
   * handle
   *
   * @var resource The fopen() resource
   * @access private
   */
  private $handle = null;
  /**
   * reading
   *
   * @var boolean Whether the script is currently reading the file
   * @access private
   */
  private $reading = false;
  /**
   * readBuffer
   * 
   * @var string Used to make sure start tags aren't missed
   * @access private
   */
  private $readBuffer = '';

  /**
   * __construct
   * 
   * Builds the Chunk object
   *
   * @param string $file The filename to work with
   * @param array $options The options with which to parse the file
   * @author Dom Hastings
   * @access public
   */
  public function __construct($file, $options = array()) {
    // merge the options together
    $this->options = array_merge($this->options, (is_array($options) ? $options : array()));

    // check that the path ends with a /
    if (substr($this->options['path'], -1) != '/') {
      $this->options['path'] .= '/';
    }

    // normalize the filename
    $file = basename($file);

    // make sure chunkSize is an int
    $this->options['chunkSize'] = intval($this->options['chunkSize']);

    // check it's valid
    if ($this->options['chunkSize'] < 64) {
      $this->options['chunkSize'] = 512;
    }

    // set the filename
    $this->file = realpath($this->options['path'].$file);

    // check the file exists
    if (!file_exists($this->file)) {
      throw new Exception('Cannot load file: '.$this->file);
    }

    // open the file
    $this->handle = fopen($this->file, 'r');

    // check the file opened successfully
    if (!$this->handle) {
      throw new Exception('Error opening file for reading');
    }
  }

  /**
   * __destruct
   * 
   * Cleans up
   *
   * @return void
   * @author Dom Hastings
   * @access public
   */
  public function __destruct() {
    // close the file resource
    fclose($this->handle);
  }

  /**
   * read
   * 
   * Reads the first available occurence of the XML element $this->options['element']
   *
   * @return string The XML string from $this->file
   * @author Dom Hastings
   * @access public
   */
  public function read() {
    // check we have an element specified
    if (!empty($this->options['element'])) {
      // trim it
      $element = trim($this->options['element']);

    } else {
      $element = '';
    }

    // initialize the buffer
    $buffer = false;

    // if the element is empty
    if (empty($element)) {
      // let the script know we're reading
      $this->reading = true;

      // read in the whole doc, cos we don't know what's wanted
      while ($this->reading) {
        $buffer .= fread($this->handle, $this->options['chunkSize']);

        $this->reading = (!feof($this->handle));
      }

      // return it all
      return $buffer;

    // we must be looking for a specific element
    } else {
      // set up the strings to find
      $open = '<'.$element.'>';
      $close = '</'.$element.'>';

      // let the script know we're reading
      $this->reading = true;

      // reset the global buffer
      $this->readBuffer = '';

      // this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
      $store = false;

      // seek to the position we need in the file
      fseek($this->handle, $this->pointer);

      // start reading
      while ($this->reading && !feof($this->handle)) {
        // store the chunk in a temporary variable
        $tmp = fread($this->handle, $this->options['chunkSize']);

        // update the global buffer
        $this->readBuffer .= $tmp;

        // check for the open string
        $checkOpen = strpos($tmp, $open);

        // if it wasn't in the new buffer
        if (!$checkOpen && !($store)) {
          // check the full buffer (in case it was only half in this buffer)
          $checkOpen = strpos($this->readBuffer, $open);

          // if it was in there
          if ($checkOpen) {
            // set it to the remainder
            $checkOpen = $checkOpen % $this->options['chunkSize'];
          }
        }

        // check for the close string
        $checkClose = strpos($tmp, $close);

        // if it wasn't in the new buffer
        if (!$checkClose && ($store)) {
          // check the full buffer (in case it was only half in this buffer)
          $checkClose = strpos($this->readBuffer, $close);

          // if it was in there
          if ($checkClose) {
            // set it to the remainder plus the length of the close string itself
            $checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
          }

        // if it was
        } elseif ($checkClose) {
          // add the length of the close string itself
          $checkClose += strlen($close);
        }

        // if we've found the opening string and we're not already reading another element
        if ($checkOpen !== false && !($store)) {
          // if we're found the end element too
          if ($checkClose !== false) {
            // append the string only between the start and end element
            $buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));

            // update the pointer
            $this->pointer += $checkClose;

            // let the script know we're done
            $this->reading = false;

          } else {
            // append the data we know to be part of this element
            $buffer .= substr($tmp, $checkOpen);

            // update the pointer
            $this->pointer += $this->options['chunkSize'];

            // let the script know we're gonna be storing all the data until we find the close element
            $store = true;
          }

        // if we've found the closing element
        } elseif ($checkClose !== false) {
          // update the buffer with the data upto and including the close tag
          $buffer .= substr($tmp, 0, $checkClose);

          // update the pointer
          $this->pointer += $checkClose;

          // let the script know we're done
          $this->reading = false;

        // if we've found the closing element, but half in the previous chunk
        } elseif ($store) {
          // update the buffer
          $buffer .= $tmp;

          // and the pointer
          $this->pointer += $this->options['chunkSize'];
        }
      }
    }

    // return the element (or the whole file if we're not looking for elements)
    return $buffer;
  }
}

Answer 4

回答by Tetsujin no Oni

I would suggest using a SAX based parser rather than DOM based parsing.

我建议使用基于 SAX 的解析器而不是基于 DOM 的解析器。

Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm

在 PHP 中使用 SAX 的信息：http: //www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm

Answer 5

回答by Frank Farmer

This isn't a great solution, but just to throw another option out there:

这不是一个很好的解决方案，只是为了抛出另一个选择：

You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).

您可以将许多大型 XML 文件分成多个块，尤其是那些实际上只是相似元素列表的文件（因为我怀疑您正在使用的文件就是这样）。

e.g., if your doc looks like:

例如，如果您的文档看起来像：

<dmoz>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  ...
</dmoz>

You can read it in a meg or two at a time, artificially wrap the few complete <listing>tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).

您可以一次读取一两兆，人为地将<listing>您加载的几个完整标签包装在根级标签中，然后通过 simplexml/domxml 加载它们（我使用 domxml，采用这种方法时）。

Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.

坦率地说，如果您使用 PHP < 5.1.2，我更喜欢这种方法。在 5.1.2 及更高版本中，XMLReader 可用，这可能是最好的选择，但在此之前，您会遇到上述分块策略或旧的 SAX/expat 库。我不了解你们其他人，但我讨厌编写/维护 SAX/expat 解析器。

Note, however, that this approach is NOT really practical when your document doesn'tconsist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)

但是请注意，当您的文档不包含许多相同的底层元素时，这种方法并不实用（例如，它适用于任何类型的文件列表或 URL 等，但不会使解析大型 HTML 文档的意义）

Answer 6

回答by Szekelygobe

This is an old post, but first in the google search result, so I thought I post another solution based on this post:

这是一篇旧帖子，但首先出现在谷歌搜索结果中，所以我想我根据这篇文章发布了另一个解决方案：

http://drib.tech/programming/parse-large-xml-files-php

This solution uses both XMLReaderand SimpleXMLElement:

此解决方案同时使用XMLReader和SimpleXMLElement：

$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL  = 'the_name_of_your_element';

$xml     = new XMLReader();
$xml->open($xmlFile);

// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}

// looping through elements
while($xml->name == $primEL) {
    // loading element data into simpleXML object
    $element = new SimpleXMLElement($xml->readOuterXML());

    // DO STUFF

    // moving pointer   
    $xml->next($primEL);
    // clearing current element
    unset($element);
} // end while

$xml->close();

Answer 7

回答by ThW

You can combine XMLReader with DOM for this. In PHP both APIs (and SimpleXML) are based on the same library - libxml2. Large XMLs are a list of records typically. So you use XMLReader to iterate the records, load a single record into DOM and use DOM methods and Xpath to extract values. The key is the method XMLReader::expand(). It loads the current node in an XMLReader instance and its descendants as DOM nodes.

为此，您可以将 XMLReader 与 DOM 结合使用。在 PHP 中，这两个 API（和 SimpleXML）都基于同一个库 - libxml2。大型 XML 通常是一个记录列表。因此，您使用 XMLReader 迭代记录，将单个记录加载到 DOM 并使用 DOM 方法和 Xpath 提取值。关键是方法XMLReader::expand()。它将 XMLReader 实例中的当前节点及其后代加载为 DOM 节点。

Example XML:

示例 XML：

<books>
  <book>
    <title isbn="978-0596100087">XSLT 1.0 Pocket Reference</title>
  </book>
  <book>
    <title isbn="978-0596100506">XML Pocket Reference</title>
  </book>
  <!-- ... -->
</books>

Example code:

示例代码：

// open the XML file
$reader = new XMLReader();
$reader->open('books.xml');

// prepare a DOM document
$document = new DOMDocument();
$xpath = new DOMXpath($document);

// find the first `book` element node at any depth
while ($reader->read() && $reader->localName !== 'book') {
  continue;
}

// as long as here is a node with the name "book"
while ($reader->localName === 'book') {
  // expand the node into the prepared DOM
  $book = $reader->expand($document);
  // use Xpath expressions to fetch values
  var_dump(
    $xpath->evaluate('string(title/@isbn)', $book),
    $xpath->evaluate('string(title)', $book)
  );
  // move to the next book sibling node
  $reader->next('book');
}
$reader->close();

Take note that the expanded node is never appended to the DOM document. It allows the GC to clean it up.

请注意，展开的节点永远不会附加到 DOM 文档。它允许 GC 清理它。

This approach works with XML namespaces as well.

这种方法也适用于 XML 名称空间。

$namespaceURI = 'urn:example-books';

$reader = new XMLReader();
$reader->open('books.xml');

$document = new DOMDocument();
$xpath = new DOMXpath($document);
// register a prefix for the Xpath expressions
$xpath->registerNamespace('b', $namespaceURI);

// compare local node name and namespace URI
while (
  $reader->read() &&
  (
    $reader->localName !== 'book' ||
    $reader->namespaceURI !== $namespaceURI
  )
) {
  continue;
}

// iterate the book elements 
while ($reader->localName === 'book') {
  // validate that they are in the namespace
  if ($reader->namespaceURI === $namespaceURI) {
    $book = $reader->expand($document);
    var_dump(
      $xpath->evaluate('string(b:title/@isbn)', $book),
      $xpath->evaluate('string(b:title)', $book)
    );
  }
  $reader->next('book');
}
$reader->close();

Answer 8

回答by Nigel Ren

I've written a wrapper for XMLReader to (IMHO) make it easier to just get the bits your after. The wrapper allows you to associate a set of paths of the data elements and a callback to be run when this path is found. The path allows regex expressions and also capture groups which can also be passed to the callback.

我已经为 XMLReader 编写了一个包装器（恕我直言），这样可以更轻松地获得所需的位。包装器允许您将一组数据元素的路径与找到此路径时要运行的回调相关联。该路径允许正则表达式并捕获也可以传递给回调的组。

The library is at https://github.com/NigelRel3/XMLReaderRegand can also be installed using composer require nigelrel3/xml-reader-reg.

该库位于https://github.com/NigelRel3/XMLReaderReg，也可以使用composer require nigelrel3/xml-reader-reg.

An example of how to use it...

一个如何使用它的例子......

$inputFile = __DIR__ ."/../tests/data/simpleTest1.xml";
$reader = new XMLReaderReg\XMLReaderReg();
$reader->open($inputFile);

$reader->process([
    '(.*/person(?:\[\d*\])?)' => function (SimpleXMLElement $data, $path): void {
        echo "1) Value for ".$path[1]." is ".PHP_EOL.
            $data->asXML().PHP_EOL;
    },
    '(.*/person3(\[\d*\])?)' => function (DOMElement $data, $path): void {
        echo "2) Value for ".$path[1]." is ".PHP_EOL.
            $data->ownerDocument->saveXML($data).PHP_EOL;
    },
    '/root/person2/firstname' => function (string $data): void {
        echo "3) Value for /root/person2/firstname is ". $data.PHP_EOL;
    }
    ]);

$reader->close();

As can be seen from the example, you can get the data to be passed as a SimpleXMLElement, a DOMElement or the last one is a string. This will represent only the data which matches the path.

从示例中可以看出，您可以将要传递的数据作为 SimpleXMLElement、DOMElement 或最后一个是字符串。这将仅表示与路径匹配的数据。

The paths also show how capture groups can be used - (.*/person(?:\[\d*\])?)looks for any person element (including arrays of elements) and $path[1]in the callback displays the path where this particular instance is found.

路径还显示了如何使用捕获组 -(.*/person(?:\[\d*\])?)查找任何 person 元素（包括元素数组），并$path[1]在回调中显示找到此特定实例的路径。

There is an expanded example in the library as well as unit tests.

库中有一个扩展示例以及单元测试。

Answer 9

回答by Alex

I tested the following code with 2 GB xml:

我用 2 GB xml 测试了以下代码：

<?php
set_time_limit(0);
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
    die("Failed to open 'data.xml'");
}
while($reader->read())
{
    $node = $reader->expand();
    // process $node...
}
$reader->close();
?>

用 PHP 解析巨大的 XML 文件

提问by Ian

回答by Emil H

回答by oskarth

回答by oskarth

回答by Tetsujin no Oni

回答by Frank Farmer

回答by Szekelygobe

回答by ThW

回答by Nigel Ren

回答by Alex

相关推荐

最近更新

标签

用 PHP 解析巨大的 XML 文件

提问by Ian

回答by Emil H

回答by oskarth

回答by oskarth

回答by Tetsujin no Oni

回答by Frank Farmer

回答by Szekelygobe

回答by ThW

回答by Nigel Ren

回答by Alex

相关推荐

将 PHP 页面作为图像返回

php 使用浏览器提示下载文件

从 <a href=""> 链接更改 PHP 变量

php 创建动态 PNG 图像

相关推荐

最近更新

标签