将 HTML 转换为 XML

Question

提问by bahadir arslan

I have got hundereds of HTML files that need to be conveted in XML. We are using these HTML to serve contents for applications but now we have to serve these contents as XML.

我有数百个需要在 XML 中转换的 HTML 文件。我们使用这些 HTML 为应用程序提供内容，但现在我们必须将这些内容作为 XML 提供。

HTML files are contains, tables, div's, image's, p's, b or strong tags, etc..

HTML 文件是包含、表格、div、图像、p、b 或强标签等。

I googled and found some applications but i couldn't achive yet.

我用谷歌搜索并找到了一些应用程序，但我还没有达到。

Could you suggest a way to convert these file contents to XML?

您能否建议一种将这些文件内容转换为 XML 的方法？

Answer 1

回答by Jarekczek

I was successful using tidycommand line utility. On linux I installed it quickly with apt-get install tidy. Then the command:

我成功使用tidy命令行实用程序。在 linux 上，我用apt-get install tidy. 然后命令：

tidy -q -asxml --numeric-entities yes source.html >file.xml

gave an xml file, which I was able to process with xslt processor. However I needed to set up xhtml1 dtds correctly.

给出了一个 xml 文件，我可以用 xslt 处理器处理它。但是我需要正确设置 xhtml1 dtds。

This is their homepage: html-tidy.org(and the legacy one: HTML Tidy)

这是他们的主页：html-tidy.org（以及旧版：HTML Tidy）

Answer 2

回答by Bob Siefkes

I did found a way to convert (even bad) html into well formed XML. I started to base this on the DOM loadHTML function. However during time several issues occurred and I optimized and added patches to correct side effects.

我确实找到了一种将（甚至是坏的）html 转换为格式良好的 XML 的方法。我开始基于 DOM loadHTML 函数。然而，在此期间发生了几个问题，我优化并添加了补丁以纠正副作用。

  function tryToXml($dom,$content) {
    if(!$content) return false;

    // xml well formed content can be loaded as xml node tree
    $fragment = $dom->createDocumentFragment();
    // wonderfull appendXML to add an XML string directly into the node tree!

    // aappendxml will fail on a xml declaration so manually skip this when occurred
    if( substr( $content,0, 5) == '<?xml' ) {
      $content = substr($content,strpos($content,'>')+1);
      if( strpos($content,'<') ) {
        $content = substr($content,strpos($content,'<'));
      }
    }

    // if appendXML is not working then use below htmlToXml() for nasty html correction
    if(!@$fragment->appendXML( $content )) {
      return $this->htmlToXml($dom,$content);
    }

    return $fragment;
  }



  // convert content into xml
  // dom is only needed to prepare the xml which will be returned
  function htmlToXml($dom, $content, $needEncoding=false, $bodyOnly=true) {

    // no xml when html is empty
    if(!$content) return false;

    // real content and possibly it needs encoding
    if( $needEncoding ) {
      // no need to convert character encoding as loadHTML will respect the content-type (only)
      $content =  '<meta http-equiv="Content-Type" content="text/html;charset='.$this->encoding.'">' . $content;
    }

    // return a dom from the content
    $domInject = new DOMDocument("1.0", "UTF-8");
    $domInject->preserveWhiteSpace = false;
    $domInject->formatOutput = true;

    // html type
    try {
      @$domInject->loadHTML( $content );
    } catch(Exception $e){
      // do nothing and continue as it's normal that warnings will occur on nasty HTML content
    }
        // to check encoding: echo $dom->encoding
        $this->reworkDom( $domInject );

    if( $bodyOnly ) {
      $fragment = $dom->createDocumentFragment();

      // retrieve nodes within /html/body
      foreach( $domInject->documentElement->childNodes as $elementLevel1 ) {
       if( $elementLevel1->nodeName == 'body' and $elementLevel1->nodeType == XML_ELEMENT_NODE ) {
         foreach( $elementLevel1->childNodes as $elementInject ) {
           $fragment->insertBefore( $dom->importNode($elementInject, true) );
         }
        }
      }
    } else {
      $fragment = $dom->importNode($domInject->documentElement, true);
    }

    return $fragment;
  }



    protected function reworkDom( $node, $level = 0 ) {

        // start with the first child node to iterate
        $nodeChild = $node->firstChild;

        while ( $nodeChild )  {
            $nodeNextChild = $nodeChild->nextSibling;

            switch ( $nodeChild->nodeType ) {
                case XML_ELEMENT_NODE:
                    // iterate through children element nodes
                    $this->reworkDom( $nodeChild, $level + 1);
                    break;
                case XML_TEXT_NODE:
                case XML_CDATA_SECTION_NODE:
                    // do nothing with text, cdata
                    break;
                case XML_COMMENT_NODE:
                    // ensure comments to remove - sign also follows the w3c guideline
                    $nodeChild->nodeValue = str_replace("-","_",$nodeChild->nodeValue);
                    break;
                case XML_DOCUMENT_TYPE_NODE:  // 10: needs to be removed
                case XML_PI_NODE: // 7: remove PI
                    $node->removeChild( $nodeChild );
                    $nodeChild = null; // make null to test later
                    break;
                case XML_DOCUMENT_NODE:
                    // should not appear as it's always the root, just to be complete
                    // however generate exception!
                case XML_HTML_DOCUMENT_NODE:
                    // should not appear as it's always the root, just to be complete
                    // however generate exception!
                default:
                    throw new exception("Engine: reworkDom type not declared [".$nodeChild->nodeType. "]");
            }
            $nodeChild = $nodeNextChild;
        } ;
    }

Now this also allows to add more html pieces into one XML which I needed to use myself. In general it can be used like this:

现在这也允许将更多的 html 片段添加到我需要自己使用的一个 XML 中。一般来说，它可以这样使用：

        $c='<p>test<font>two</p>';
    $dom=new DOMDocument('1.0', 'UTF-8');

$n=$dom->appendChild($dom->createElement('info')); // make a root element

if( $valueXml=tryToXml($dom,$c) ) {
  $n->appendChild($valueXml);
}
    echo '<pre/>'. htmlentities($dom->saveXml($n)). '</pre>';

In this example 'testtwo'will nicely be outputed in well formed XML as '<info>testtwo</info>'. The info root tag is added as it will also allow to convert 'onetwo' which is not XML as it has not one root element. However if you html does for sure have one root element then the extra root <info>tag can be skipped.

在这个例子中，'testtwo'将很好地以格式良好的 XML 输出为 ' <info>testtwo</info>'。添加 info 根标记是因为它还允许转换onetwo不是 XML 的“ ”，因为它没有一个根元素。但是，如果您的 html 确实有一个根元素，则<info>可以跳过额外的根标记。

With this I'm getting real nice XML out of unstructured and even corrupted HTML!

有了这个，我从非结构化甚至损坏的 HTML 中获得了真正漂亮的 XML！

I hope it's a bit clear and might contribute to other people to use it.

我希望它有点清楚并且可能有助于其他人使用它。

Answer 3

回答by Coffee

Remember that HTML and XML are two distinct concepts in the tree of markup languages. You can't exactly replace HTML with XML. XML can be viewed as a generalized form of HTML, but even that is imprecise. You mainly use HTML to display data, and XML to carry(or store) the data.

请记住，HTML 和 XML 是标记语言树中的两个不同概念。您不能完全用 XML 替换 HTML。XML 可以被视为 HTML 的一种通用形式，但即便如此也是不精确的。您主要使用 HTML 来显示数据，使用 XML 来承载（或存储）数据。

This link is helpful: How to read HTML as XML?

此链接很有帮助：如何将 HTML 读取为 XML？

More here - difference between HTML and XML

更多信息 - HTML 和 XML 之间的区别

将 HTML 转换为 XML

提问by bahadir arslan

回答by Jarekczek

回答by Bob Siefkes

回答by Coffee

相关推荐

最近更新

标签

将 HTML 转换为 XML

提问by bahadir arslan

回答by Jarekczek

回答by Bob Siefkes

回答by Coffee

相关推荐

Html hr 通过 css 带有图像

Html 颠倒的插入符号

Html css链过渡动画

HTML 字符 - 不可见空间

相关推荐

最近更新

标签