php 如何在没有 HTML 包装器的情况下保存 DOMDocument 的 HTML？

Question

提问by Scott B

I'm the function below, I'm struggling to output the DOMDocument without it appending the XML, HTML, bodyand ptag wrappers before the output of the content. The suggested fix:

我是下面的函数，我正在努力输出 DOMDocument 而不在内容输出之前附加 XML、HTML、body和p标记包装器。建议的修复：

$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));

Only works when the content has no block level elements inside it. However, when it does, as in the example below with the h1 element, the resulting output from saveXML is truncated to...

仅当内容中没有块级元素时才有效。但是，当它这样做时，如下面的带有 h1 元素的示例所示，来自 saveXML 的结果输出将被截断为...

<p>If you like</p>

<p>如果你喜欢</p>

I've been pointed to this post as a possible workaround, but I can't understand how to implement it into this solution (see commented out attempts below).

有人指出这篇文章是一种可能的解决方法，但我无法理解如何将其实施到此解决方案中（请参阅下面注释掉的尝试）。

Any suggestions?

有什么建议？

function rseo_decorate_keyword($postarray) {
    global $post;
    $keyword = "Jasmine Tea"
    $content = "If you like <h1>jasmine tea</h1> you will really like it with Jasmine Tea flavors. This is the last ocurrence of the phrase jasmine tea within the content. If there are other instances of the keyword jasmine tea within the text what happens to jasmine tea."
    $d = new DOMDocument();
    @$d->loadHTML($content);
    $x = new DOMXpath($d);
    $count = $x->evaluate("count(//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and (ancestor::b or ancestor::strong)])");
    if ($count > 0) return $postarray;
    $nodes = $x->query("//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and not(ancestor::h1) and not(ancestor::h2) and not(ancestor::h3) and not(ancestor::h4) and not(ancestor::h5) and not(ancestor::h6) and not(ancestor::b) and not(ancestor::strong)]");
    if ($nodes && $nodes->length) {
        $node = $nodes->item(0);
        // Split just before the keyword
        $keynode = $node->splitText(strpos($node->textContent, $keyword));
        // Split after the keyword
        $node->nextSibling->splitText(strlen($keyword));
        // Replace keyword with <b>keyword</b>
        $replacement = $d->createElement('strong', $keynode->textContent);
        $keynode->parentNode->replaceChild($replacement, $keynode);
    }
$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));
//  $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->item(1));
//  $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->childNodes);
return $postarray;
}

Answer 1

回答by Alessandro Vendruscolo

All of these answers are now wrong, because as of PHP 5.4 and Libxml 2.6 loadHTMLnow has a $optionparameter which instructs Libxml about how it should parse the content.

所有这些答案现在都是错误的，因为从 PHP 5.4 和 Libxml 2.6 开始， loadHTML现在有一个$option参数指示 Libxml 应该如何解析内容。

Therefore, if we load the HTML with these options

因此，如果我们使用这些选项加载 HTML

$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

when doing saveHTML()there will be no doctype, no <html>, and no <body>.

做的saveHTML()时候不会有doctype，没有<html>，没有<body>。

LIBXML_HTML_NOIMPLIEDturns off the automatic adding of implied html/body elements LIBXML_HTML_NODEFDTDprevents a default doctype being added when one is not found.

LIBXML_HTML_NOIMPLIED关闭隐含的 html/body 元素的自动添加 LIBXML_HTML_NODEFDTD可防止在找不到默认文档类型时添加默认文档类型。

Full documentation about Libxml parameters is here

关于 Libxml 参数的完整文档在这里

(Note that loadHTMLdocs say that Libxml 2.6 is needed, but LIBXML_HTML_NODEFDTDis only available in Libxml 2.7.8 and LIBXML_HTML_NOIMPLIEDis available in Libxml 2.7.7)

（请注意，loadHTML文档说需要 Libxml 2.6，但LIBXML_HTML_NODEFDTD仅在 Libxml 2.7.8LIBXML_HTML_NOIMPLIED中可用，在 Libxml 2.7.7 中可用）

Answer 2

回答by Alex

Just remove the nodes directly after loading the document with loadHTML():

只需在使用 loadHTML() 加载文档后直接删除节点：

# remove <!DOCTYPE 
$doc->removeChild($doc->doctype);           

# remove <html><body></body></html> 
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

Answer 3

回答by Jonah

Use saveXML()instead, and pass the documentElement as an argument to it.

saveXML()改为使用，并将 documentElement 作为参数传递给它。

$innerHTML = '';
foreach ($document->getElementsByTagName('p')->item(0)->childNodes as $child) {
    $innerHTML .= $document->saveXML($child);
}
echo $innerHTML;

http://php.net/domdocument.savexml

Answer 4

回答by Super Cat

The issue with the top answer is that LIBXML_HTML_NOIMPLIEDis unstable.

最佳答案的问题LIBXML_HTML_NOIMPLIED是不稳定。

It can reorder elements (particularly, moving the top element's closing tag to the bottom of the document), add random ptags, and perhaps a variety of other issues[1]. It may remove the htmland bodytags for you, but at the cost of unstable behavior. In production, that's a red flag. In short:

它可以对元素重新排序（特别是将顶部元素的结束标签移到文档底部）、添加随机p标签，以及各种其他问题[1]。它可能会为您删除html和body标签，但代价是行为不稳定。在生产中，这是一个危险信号。简而言之：

Don't use LIBXML_HTML_NOIMPLIED. Instead, use substr.

不要使用LIBXML_HTML_NOIMPLIED. 相反，使用substr.

Think about it. The lengths of <html><body>and </body></html>are fixed and at both ends of the document - their sizes never change, and neither do their positions. This allows us to use substrto cut them away:

想想看。的长度<html><body>和</body></html>固定，并在文档的两端-它们的大小不会改变，也不做他们的位置。这使我们可以使用substr将它们切掉：

$dom = new domDocument; 
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);

echo substr($dom->saveHTML(), 12, -15); // the star of this operation

(THIS IS NOT THE FINAL SOLUTION HOWEVER! See below for the complete answer, keep reading for context)

（然而，这不是最终解决方案！请参阅下面的完整答案，继续阅读上下文）

We cut 12away from the start of the document because <html><body>= 12 characters (<<>>+html+body= 4+4+4), and we go backwards and cut 15 off the end because \n</body></html>= 15 characters (\n+//+<<>>+body+html= 1 + 2 + 4 + 4 + 4)

我们12从文档的开头切掉因为<html><body>= 12 个字符 ( <<>>+html+body= 4+4+4)，我们倒退并从末尾切掉 15 个因为\n</body></html>= 15 个字符 ( \n+//+<<>>+body+html= 1 + 2 + 4 + 4 + 4)

Notice that I still use LIBXML_HTML_NODEFDTDomit the !DOCTYPEfrom being included. First, this simplifies the substrremoval of the HTML/BODY tags. Second, we don't remove the doctype with substrbecause we don't know if the 'default doctype' will always be something of a fixed length. But, most importantly, LIBXML_HTML_NODEFDTDstops the DOM parser from applying a non-HTML5 doctype to the document - which at least prevents the parser from treating elements it doesn't recognize as loose text.

请注意，我仍然使用LIBXML_HTML_NODEFDTD省略!DOCTYPEfrom 被包含在内。首先，这简化了substrHTML/BODY 标签的删除。其次，我们不删除文档类型 withsubstr因为我们不知道 ' default doctype' 是否总是固定长度的。但是，最重要的是，LIBXML_HTML_NODEFDTD阻止 DOM 解析器将非 HTML5 doctype 应用于文档——这至少可以防止解析器将它不能识别为松散文本的元素处理。

We know for a fact that the HTML/BODY tags are of fixed lengths and positions, and we know that constants like LIBXML_HTML_NODEFDTDare never removed without some type of deprecation notice, so the above method should roll well into the future, BUT...

我们知道 HTML/BODY 标签的长度和位置是固定的，并且我们知道在LIBXML_HTML_NODEFDTD没有某种类型的弃用通知的情况下永远不会删除诸如此类的常量，因此上述方法应该可以很好地应用于未来，但是......

...the only caveat is that the DOM implementation couldchange the way in HTML/BODY tags are placed within the document - for instance, removing the newline at the end of the document, adding spaces between the tags, or adding newlines.

...唯一需要注意的是，DOM 实现可能会改变 HTML/BODY 标签在文档中的放置方式 - 例如，删除文档末尾的换行符、在标签之间添加空格或添加换行符。

This can be remedied by searching for the positions of the opening and closing tags for body, and using those offsets as for our lengths to trim off. We use strposand strrposto find the offsets from the front and back, respectively:

这可以通过搜索的开始和结束标签的位置来解决body，并使用这些偏移量来修剪我们的长度。我们分别使用strpos和strrpos来查找前后的偏移量：

$dom = new domDocument; 
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);

$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
// PositionOf<body> + 6 = Cutoff offset after '<body>'
// 6 = Length of '<body>'

$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());
// ^ PositionOf</body> - LengthOfDocument = Relative-negative cutoff offset before '</body>'

echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);

In closing, a repeat of the final, future-proof answer:

最后，重复最后的、面向未来的答案：

$dom = new domDocument; 
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);

$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());

echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);

No doctype, no html tag, no body tag. We can only hope the DOM parser will receive a fresh coat of paint soon and we can more directly eliminate these unwanted tags.

没有文档类型，没有 html 标签，没有正文标签。我们只能希望 DOM 解析器能尽快焕然一新，我们可以更直接地消除这些不需要的标签。

Answer 5

回答by lonesomeday

A neat trick is to use loadXMLand then saveHTML. The htmland bodytags are inserted at the loadstage, not the savestage.

一个巧妙的技巧是使用loadXML然后saveHTML。在html和body标签插入到load舞台，没有save舞台。

$dom = new DOMDocument;
$dom->loadXML('<p>My DOMDocument contents are here</p>');
echo $dom->saveHTML();

NB that this is a bit hacky and you should use Jonah's answer if you can get it to work.

请注意，这有点棘手，如果您可以使用它，您应该使用 Jonah 的答案。

Answer 6

回答by jcp

use DOMDocumentFragment

使用 DOMDocumentFragment

$html = 'what you want';
$doc = new DomDocument();
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($html);
$doc->appendChild($fragment);
echo $doc->saveHTML();

Answer 7

回答by Vixxs

It's 2017, and for this 2011 Question I don't like any of the answers. Lots of regex, big classes, loadXML etc...

现在是 2017 年，对于这个 2011 年的问题，我不喜欢任何答案。大量的正则表达式、大类、loadXML 等...

Easy solution which solves the known problems:

解决已知问题的简单解决方案：

$dom = new DOMDocument();
$dom->loadHTML( '<html><body>'.mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8').'</body></html>' , LIBXML_HTML_NODEFDTD);
$html = substr(trim($dom->saveHTML()),12,-14);

Easy, Simple, Solid, Fast. This code will work regarding HTML tags and encoding like:

简单、简单、可靠、快速。此代码将适用于 HTML 标签和编码，如：

$html = '<p>??ü</p><p>?</p>';

If anybody finds an error , please tell, I will use this myself.

如果有人发现错误，请告诉，我会自己使用。

Edit, Other valid options that work without errors (very similar to ones already given):

编辑，其他有效且无错误的选项（与已经给出的选项非常相似）：

@$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$saved_dom = trim($dom->saveHTML());
$start_dom = stripos($saved_dom,'<body>')+6;
$html = substr($saved_dom,$start_dom,strripos($saved_dom,'</body>') - $start_dom );

You could add body yourself to prevent any strange thing on the furure.

您可以自己添加 body 以防止在 furure 上出现任何奇怪的事情。

Thirt option:

第三个选项：

 $mock = new DOMDocument;
 $body = $dom->getElementsByTagName('body')->item(0);
  foreach ($body->childNodes as $child){
     $mock->appendChild($mock->importNode($child, true));
  }
$html = trim($mock->saveHTML());

Answer 8

回答by hakre

I'm a bit late in the club but didn't want to notshare a method I've found out about. First of all I've got the right versions for loadHTML() to accept these nice options, but LIBXML_HTML_NOIMPLIEDdidn't work on my system. Also users report problems with the parser (for example hereand here).

我在俱乐部有点晚了，但不想不分享我发现的方法。首先，我有合适的 loadHTML() 版本来接受这些不错的选项，但LIBXML_HTML_NOIMPLIED在我的系统上不起作用。用户还报告了解析器的问题（例如这里和这里）。

The solution I created actually is pretty simple.

我创建的解决方案实际上非常简单。

HTML to be loaded is put in a <div>element so it has a container containing all nodes to be loaded.

要加载的 HTML 放在一个<div>元素中，因此它有一个包含所有要加载的节点的容器。

Then this container element is removed from the document (but the DOMElementof it still exists).

然后这个容器元素从文档中删除（但它的DOMElement仍然存在）。

Then all direct children from the document are removed. This includes any added <html>, <head>and <body>tags (effectively LIBXML_HTML_NOIMPLIEDoption) as well as the <!DOCTYPE html ... loose.dtd">declaration (effectively LIBXML_HTML_NODEFDTD).

然后删除文档中的所有直接子级。这包括任何添加的<html>,<head>和<body>标签（有效LIBXML_HTML_NOIMPLIED选项）以及<!DOCTYPE html ... loose.dtd">声明（有效LIBXML_HTML_NODEFDTD）。

Then all direct children of the container are added to the document again and it can be output.

然后容器的所有直接子项再次添加到文档中，并且可以输出。

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';

$doc = new DOMDocument();

$doc->loadHTML("<div>$str</div>");

$container = $doc->getElementsByTagName('div')->item(0);

$container = $container->parentNode->removeChild($container);

while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

$htmlFragment = $doc->saveHTML();

XPath works as usual, just take care that there are multiple document elements now, so not a single root node:

XPath 像往常一样工作，只需注意现在有多个文档元素，所以不是单个根节点：

$xpath = new DOMXPath($doc);
foreach ($xpath->query('/p') as $element)
{   #                   ^- note the single slash "/"
    # ... each of the two <p> element

PHP 5.4.36-1+deb.sury.org~precise+2 (cli) (built: Dec 21 2014 20:28:53)

PHP 5.4.36-1+deb.sury.org~precise+2 (cli)（构建时间：2014 年 12 月 21 日 20:28:53）

Answer 9

回答by plowman

None of the other solutions at the time of this writing (June, 2012) were able to completely meet my needs, so I wrote one which handles the following cases:

在撰写本文时（2012 年 6 月），没有其他解决方案能够完全满足我的需求，因此我编写了一个处理以下情况的解决方案：

Accepts plain-text content which has no tags, as well as HTML content.
Does not append any tags (including <doctype>, <xml>, <html>, <body>, and <p>tags)
Leaves anything wrapped in <p>alone.
Leaves empty text alone.

接受没有标签的纯文本内容以及 HTML 内容。
不附加任何标签（包括<doctype>，<xml>，<html>，<body>，和<p>标签）
将任何东西<p>单独包裹起来。
单独留下空文本。

So here is a solution which fixes those issues:

所以这里有一个解决这些问题的解决方案：

class DOMDocumentWorkaround
{
    /**
     * Convert a string which may have HTML components into a DOMDocument instance.
     *
     * @param string $html - The HTML text to turn into a string.
     * @return \DOMDocument - A DOMDocument created from the given html.
     */
    public static function getDomDocumentFromHtml($html)
    {
        $domDocument = new DOMDocument();

        // Wrap the HTML in <div> tags because loadXML expects everything to be within some kind of tag.
        // LIBXML_NOERROR and LIBXML_NOWARNING mean this will fail silently and return an empty DOMDocument if it fails.
        $domDocument->loadXML('<div>' . $html . '</div>', LIBXML_NOERROR | LIBXML_NOWARNING);

        return $domDocument;
    }

    /**
     * Convert a DOMDocument back into an HTML string, which is reasonably close to what we started with.
     *
     * @param \DOMDocument $domDocument
     * @return string - The resulting HTML string
     */
    public static function getHtmlFromDomDocument($domDocument)
    {
        // Convert the DOMDocument back to a string.
        $xml = $domDocument->saveXML();

        // Strip out the XML declaration, if one exists
        $xmlDeclaration = "<?xml version=\"1.0\"?>\n";
        if (substr($xml, 0, strlen($xmlDeclaration)) == $xmlDeclaration) {
            $xml = substr($xml, strlen($xmlDeclaration));
        }

        // If the original HTML was empty, loadXML collapses our <div></div> into <div/>. Remove it.
        if ($xml == "<div/>\n") {
            $xml = '';
        }
        else {
            // Remove the opening <div> tag we previously added, if it exists.
            $openDivTag = "<div>";
            if (substr($xml, 0, strlen($openDivTag)) == $openDivTag) {
                $xml = substr($xml, strlen($openDivTag));
            }

            // Remove the closing </div> tag we previously added, if it exists.
            $closeDivTag = "</div>\n";
            $closeChunk = substr($xml, -strlen($closeDivTag));
            if ($closeChunk == $closeDivTag) {
                $xml = substr($xml, 0, -strlen($closeDivTag));
            }
        }

        return $xml;
    }
}

I also wrote some tests which would live in that same class:

我还写了一些测试，这些测试可以放在同一个类中：

public static function testHtmlToDomConversions($content)
{
    // test that converting the $content to a DOMDocument and back does not change the HTML
    if ($content !== self::getHtmlFromDomDocument(self::getDomDocumentFromHtml($content))) {
        echo "Failed\n";
    }
    else {
        echo "Succeeded\n";
    }
}

public static function testAll()
{
    self::testHtmlToDomConversions('<p>Here is some sample text</p>');
    self::testHtmlToDomConversions('<div>Lots of <div>nested <div>divs</div></div></div>');
    self::testHtmlToDomConversions('Normal Text');
    self::testHtmlToDomConversions(''); //empty
}

You can check that it works for yourself. DomDocumentWorkaround::testAll()returns this:

您可以检查它是否适合自己。DomDocumentWorkaround::testAll()返回这个：

    Succeeded
    Succeeded
    Succeeded
    Succeeded

Answer 10

回答by rclai

Okay I found a more elegant solution, but it's just tedious:

好的，我找到了一个更优雅的解决方案，但它只是乏味：

$d = new DOMDocument();
@$d->loadHTML($yourcontent);
...
// do your manipulation, processing, etc of it blah blah blah
...
// then to save, do this
$x = new DOMXPath($d);
$everything = $x->query("body/*"); // retrieves all elements inside body tag
if ($everything->length > 0) { // check if it retrieved anything in there
      $output = '';
      foreach ($everything as $thing) {
           $output .= $d->saveXML($thing);
      }
      echo $output; // voila, no more annoying html wrappers or body tag
}

Alright, hopefully this does not omit anything and helps somebody?

好吧，希望这不会遗漏任何内容并帮助某人？

php 如何在没有 HTML 包装器的情况下保存 DOMDocument 的 HTML？

提问by Scott B

回答by Alessandro Vendruscolo

回答by Alex

回答by Jonah

回答by Super Cat

回答by lonesomeday

回答by jcp

回答by Vixxs

回答by hakre

回答by plowman

回答by rclai

相关推荐

最近更新

标签

php 如何在没有 HTML 包装器的情况下保存 DOMDocument 的 HTML？

提问by Scott B

回答by Alessandro Vendruscolo

回答by Alex

回答by Jonah

回答by Super Cat

回答by lonesomeday

回答by jcp

回答by Vixxs

回答by hakre

回答by plowman

回答by rclai

相关推荐

php 如何在php中将日期和时间转换为时间戳？

php 致命错误：超过了 0 秒的最大执行时间

php 重定向后PHP获取上一页网址

PHP Flush 有效......即使在 Nginx 中

相关推荐

最近更新

标签