php 如何在没有 HTML 包装器的情况下保存 DOMDocument 的 HTML?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4879946/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to saveHTML of DOMDocument without HTML wrapper?
提问by Scott B
I'm the function below, I'm struggling to output the DOMDocument without it appending the XML, HTML, bodyand ptag wrappers before the output of the content. The suggested fix:
我是下面的函数,我正在努力输出 DOMDocument 而不在内容输出之前附加 XML、HTML、body和p标记包装器。建议的修复:
$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));
Only works when the content has no block level elements inside it. However, when it does, as in the example below with the h1 element, the resulting output from saveXML is truncated to...
仅当内容中没有块级元素时才有效。但是,当它这样做时,如下面的带有 h1 元素的示例所示,来自 saveXML 的结果输出将被截断为...
<p>If you like</p>
<p>如果你喜欢</p>
I've been pointed to this post as a possible workaround, but I can't understand how to implement it into this solution (see commented out attempts below).
有人指出这篇文章是一种可能的解决方法,但我无法理解如何将其实施到此解决方案中(请参阅下面注释掉的尝试)。
Any suggestions?
有什么建议?
function rseo_decorate_keyword($postarray) {
global $post;
$keyword = "Jasmine Tea"
$content = "If you like <h1>jasmine tea</h1> you will really like it with Jasmine Tea flavors. This is the last ocurrence of the phrase jasmine tea within the content. If there are other instances of the keyword jasmine tea within the text what happens to jasmine tea."
$d = new DOMDocument();
@$d->loadHTML($content);
$x = new DOMXpath($d);
$count = $x->evaluate("count(//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and (ancestor::b or ancestor::strong)])");
if ($count > 0) return $postarray;
$nodes = $x->query("//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and not(ancestor::h1) and not(ancestor::h2) and not(ancestor::h3) and not(ancestor::h4) and not(ancestor::h5) and not(ancestor::h6) and not(ancestor::b) and not(ancestor::strong)]");
if ($nodes && $nodes->length) {
$node = $nodes->item(0);
// Split just before the keyword
$keynode = $node->splitText(strpos($node->textContent, $keyword));
// Split after the keyword
$node->nextSibling->splitText(strlen($keyword));
// Replace keyword with <b>keyword</b>
$replacement = $d->createElement('strong', $keynode->textContent);
$keynode->parentNode->replaceChild($replacement, $keynode);
}
$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));
// $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->item(1));
// $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->childNodes);
return $postarray;
}
回答by Alessandro Vendruscolo
All of these answers are now wrong, because as of PHP 5.4 and Libxml 2.6 loadHTML
now has a $option
parameter which instructs Libxml about how it should parse the content.
所有这些答案现在都是错误的,因为从 PHP 5.4 和 Libxml 2.6 开始, loadHTML
现在有一个$option
参数指示 Libxml 应该如何解析内容。
Therefore, if we load the HTML with these options
因此,如果我们使用这些选项加载 HTML
$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
when doing saveHTML()
there will be no doctype
, no <html>
, and no <body>
.
做的saveHTML()
时候不会有doctype
,没有<html>
,没有<body>
。
LIBXML_HTML_NOIMPLIED
turns off the automatic adding of implied html/body elementsLIBXML_HTML_NODEFDTD
prevents a default doctype being added when one is not found.
LIBXML_HTML_NOIMPLIED
关闭隐含的 html/body 元素的自动添加LIBXML_HTML_NODEFDTD
可防止在找不到默认文档类型时添加默认文档类型。
Full documentation about Libxml parameters is here
关于 Libxml 参数的完整文档在这里
(Note that loadHTML
docs say that Libxml 2.6 is needed, but LIBXML_HTML_NODEFDTD
is only available in Libxml 2.7.8 and LIBXML_HTML_NOIMPLIED
is available in Libxml 2.7.7)
(请注意,loadHTML
文档说需要 Libxml 2.6,但LIBXML_HTML_NODEFDTD
仅在 Libxml 2.7.8LIBXML_HTML_NOIMPLIED
中可用,在 Libxml 2.7.7 中可用)
回答by Alex
Just remove the nodes directly after loading the document with loadHTML():
只需在使用 loadHTML() 加载文档后直接删除节点:
# remove <!DOCTYPE
$doc->removeChild($doc->doctype);
# remove <html><body></body></html>
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
回答by Jonah
Use saveXML()
instead, and pass the documentElement as an argument to it.
saveXML()
改为使用,并将 documentElement 作为参数传递给它。
$innerHTML = '';
foreach ($document->getElementsByTagName('p')->item(0)->childNodes as $child) {
$innerHTML .= $document->saveXML($child);
}
echo $innerHTML;
回答by Super Cat
The issue with the top answer is that LIBXML_HTML_NOIMPLIED
is unstable.
最佳答案的问题LIBXML_HTML_NOIMPLIED
是不稳定。
It can reorder elements (particularly, moving the top element's closing tag to the bottom of the document), add random p
tags, and perhaps a variety of other issues[1]. It may remove the html
and body
tags for you, but at the cost of unstable behavior. In production, that's a red flag. In short:
它可以对元素重新排序(特别是将顶部元素的结束标签移到文档底部)、添加随机p
标签,以及各种其他问题[1]。它可能会为您删除html
和body
标签,但代价是行为不稳定。在生产中,这是一个危险信号。简而言之:
Don't use LIBXML_HTML_NOIMPLIED
. Instead, use substr
.
不要使用LIBXML_HTML_NOIMPLIED
. 相反,使用substr
.
Think about it. The lengths of <html><body>
and </body></html>
are fixed and at both ends of the document - their sizes never change, and neither do their positions. This allows us to use substr
to cut them away:
想想看。的长度<html><body>
和</body></html>
固定,并在文档的两端-它们的大小不会改变,也不做他们的位置。这使我们可以使用substr
将它们切掉:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
echo substr($dom->saveHTML(), 12, -15); // the star of this operation
(THIS IS NOT THE FINAL SOLUTION HOWEVER! See below for the complete answer, keep reading for context)
(然而,这不是最终解决方案!请参阅下面的完整答案,继续阅读上下文)
We cut 12
away from the start of the document because <html><body>
= 12 characters (<<>>+html+body
= 4+4+4), and we go backwards and cut 15 off the end because \n</body></html>
= 15 characters (\n+//+<<>>+body+html
= 1 + 2 + 4 + 4 + 4)
我们12
从文档的开头切掉因为<html><body>
= 12 个字符 ( <<>>+html+body
= 4+4+4),我们倒退并从末尾切掉 15 个因为\n</body></html>
= 15 个字符 ( \n+//+<<>>+body+html
= 1 + 2 + 4 + 4 + 4)
Notice that I still use LIBXML_HTML_NODEFDTD
omit the !DOCTYPE
from being included. First, this simplifies the substr
removal of the HTML/BODY tags. Second, we don't remove the doctype with substr
because we don't know if the 'default doctype
' will always be something of a fixed length. But, most importantly, LIBXML_HTML_NODEFDTD
stops the DOM parser from applying a non-HTML5 doctype to the document - which at least prevents the parser from treating elements it doesn't recognize as loose text.
请注意,我仍然使用LIBXML_HTML_NODEFDTD
省略!DOCTYPE
from 被包含在内。首先,这简化了substr
HTML/BODY 标签的删除。其次,我们不删除文档类型 withsubstr
因为我们不知道 ' default doctype
' 是否总是固定长度的。但是,最重要的是,LIBXML_HTML_NODEFDTD
阻止 DOM 解析器将非 HTML5 doctype 应用于文档——这至少可以防止解析器将它不能识别为松散文本的元素处理。
We know for a fact that the HTML/BODY tags are of fixed lengths and positions, and we know that constants like LIBXML_HTML_NODEFDTD
are never removed without some type of deprecation notice, so the above method should roll well into the future, BUT...
我们知道 HTML/BODY 标签的长度和位置是固定的,并且我们知道在LIBXML_HTML_NODEFDTD
没有某种类型的弃用通知的情况下永远不会删除诸如此类的常量,因此上述方法应该可以很好地应用于未来,但是......
...the only caveat is that the DOM implementation couldchange the way in HTML/BODY tags are placed within the document - for instance, removing the newline at the end of the document, adding spaces between the tags, or adding newlines.
...唯一需要注意的是,DOM 实现可能会改变 HTML/BODY 标签在文档中的放置方式 - 例如,删除文档末尾的换行符、在标签之间添加空格或添加换行符。
This can be remedied by searching for the positions of the opening and closing tags for body
, and using those offsets as for our lengths to trim off. We use strpos
and strrpos
to find the offsets from the front and back, respectively:
这可以通过搜索 的开始和结束标签的位置来解决body
,并使用这些偏移量来修剪我们的长度。我们分别使用strpos
和strrpos
来查找前后的偏移量:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
// PositionOf<body> + 6 = Cutoff offset after '<body>'
// 6 = Length of '<body>'
$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());
// ^ PositionOf</body> - LengthOfDocument = Relative-negative cutoff offset before '</body>'
echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);
In closing, a repeat of the final, future-proof answer:
最后,重复最后的、面向未来的答案:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());
echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);
No doctype, no html tag, no body tag. We can only hope the DOM parser will receive a fresh coat of paint soon and we can more directly eliminate these unwanted tags.
没有文档类型,没有 html 标签,没有正文标签。我们只能希望 DOM 解析器能尽快焕然一新,我们可以更直接地消除这些不需要的标签。
回答by lonesomeday
A neat trick is to use loadXML
and then saveHTML
. The html
and body
tags are inserted at the load
stage, not the save
stage.
一个巧妙的技巧是使用loadXML
然后saveHTML
。在html
和body
标签插入到load
舞台,没有save
舞台。
$dom = new DOMDocument;
$dom->loadXML('<p>My DOMDocument contents are here</p>');
echo $dom->saveHTML();
NB that this is a bit hacky and you should use Jonah's answer if you can get it to work.
请注意,这有点棘手,如果您可以使用它,您应该使用 Jonah 的答案。
回答by jcp
use DOMDocumentFragment
使用 DOMDocumentFragment
$html = 'what you want';
$doc = new DomDocument();
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($html);
$doc->appendChild($fragment);
echo $doc->saveHTML();
回答by Vixxs
It's 2017, and for this 2011 Question I don't like any of the answers. Lots of regex, big classes, loadXML etc...
现在是 2017 年,对于这个 2011 年的问题,我不喜欢任何答案。大量的正则表达式、大类、loadXML 等...
Easy solution which solves the known problems:
解决已知问题的简单解决方案:
$dom = new DOMDocument();
$dom->loadHTML( '<html><body>'.mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8').'</body></html>' , LIBXML_HTML_NODEFDTD);
$html = substr(trim($dom->saveHTML()),12,-14);
Easy, Simple, Solid, Fast. This code will work regarding HTML tags and encoding like:
简单、简单、可靠、快速。此代码将适用于 HTML 标签和编码,如:
$html = '<p>??ü</p><p>?</p>';
If anybody finds an error , please tell, I will use this myself.
如果有人发现错误,请告诉,我会自己使用。
Edit, Other valid options that work without errors (very similar to ones already given):
编辑,其他有效且无错误的选项(与已经给出的选项非常相似):
@$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$saved_dom = trim($dom->saveHTML());
$start_dom = stripos($saved_dom,'<body>')+6;
$html = substr($saved_dom,$start_dom,strripos($saved_dom,'</body>') - $start_dom );
You could add body yourself to prevent any strange thing on the furure.
您可以自己添加 body 以防止在 furure 上出现任何奇怪的事情。
Thirt option:
第三个选项:
$mock = new DOMDocument;
$body = $dom->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$mock->appendChild($mock->importNode($child, true));
}
$html = trim($mock->saveHTML());
回答by hakre
I'm a bit late in the club but didn't want to notshare a method I've found out about. First of all I've got the right versions for loadHTML() to accept these nice options, but LIBXML_HTML_NOIMPLIED
didn't work on my system. Also users report problems with the parser (for example hereand here).
我在俱乐部有点晚了,但不想不分享我发现的方法。首先,我有合适的 loadHTML() 版本来接受这些不错的选项,但LIBXML_HTML_NOIMPLIED
在我的系统上不起作用。用户还报告了解析器的问题(例如这里和这里)。
The solution I created actually is pretty simple.
我创建的解决方案实际上非常简单。
HTML to be loaded is put in a <div>
element so it has a container containing all nodes to be loaded.
要加载的 HTML 放在一个<div>
元素中,因此它有一个包含所有要加载的节点的容器。
Then this container element is removed from the document (but the DOMElementof it still exists).
然后这个容器元素从文档中删除(但它的DOMElement仍然存在)。
Then all direct children from the document are removed. This includes any added <html>
, <head>
and <body>
tags (effectively LIBXML_HTML_NOIMPLIED
option) as well as the <!DOCTYPE html ... loose.dtd">
declaration (effectively LIBXML_HTML_NODEFDTD
).
然后删除文档中的所有直接子级。这包括任何添加的<html>
,<head>
和<body>
标签(有效LIBXML_HTML_NOIMPLIED
选项)以及<!DOCTYPE html ... loose.dtd">
声明(有效LIBXML_HTML_NODEFDTD
)。
Then all direct children of the container are added to the document again and it can be output.
然后容器的所有直接子项再次添加到文档中,并且可以输出。
$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';
$doc = new DOMDocument();
$doc->loadHTML("<div>$str</div>");
$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);
while ($doc->firstChild) {
$doc->removeChild($doc->firstChild);
}
while ($container->firstChild ) {
$doc->appendChild($container->firstChild);
}
$htmlFragment = $doc->saveHTML();
XPath works as usual, just take care that there are multiple document elements now, so not a single root node:
XPath 像往常一样工作,只需注意现在有多个文档元素,所以不是单个根节点:
$xpath = new DOMXPath($doc);
foreach ($xpath->query('/p') as $element)
{ # ^- note the single slash "/"
# ... each of the two <p> element
- PHP 5.4.36-1+deb.sury.org~precise+2 (cli) (built: Dec 21 2014 20:28:53)
- PHP 5.4.36-1+deb.sury.org~precise+2 (cli)(构建时间:2014 年 12 月 21 日 20:28:53)
回答by plowman
None of the other solutions at the time of this writing (June, 2012) were able to completely meet my needs, so I wrote one which handles the following cases:
在撰写本文时(2012 年 6 月),没有其他解决方案能够完全满足我的需求,因此我编写了一个处理以下情况的解决方案:
- Accepts plain-text content which has no tags, as well as HTML content.
- Does not append any tags (including
<doctype>
,<xml>
,<html>
,<body>
, and<p>
tags) - Leaves anything wrapped in
<p>
alone. - Leaves empty text alone.
- 接受没有标签的纯文本内容以及 HTML 内容。
- 不附加任何标签(包括
<doctype>
,<xml>
,<html>
,<body>
,和<p>
标签) - 将任何东西
<p>
单独包裹起来。 - 单独留下空文本。
So here is a solution which fixes those issues:
所以这里有一个解决这些问题的解决方案:
class DOMDocumentWorkaround
{
/**
* Convert a string which may have HTML components into a DOMDocument instance.
*
* @param string $html - The HTML text to turn into a string.
* @return \DOMDocument - A DOMDocument created from the given html.
*/
public static function getDomDocumentFromHtml($html)
{
$domDocument = new DOMDocument();
// Wrap the HTML in <div> tags because loadXML expects everything to be within some kind of tag.
// LIBXML_NOERROR and LIBXML_NOWARNING mean this will fail silently and return an empty DOMDocument if it fails.
$domDocument->loadXML('<div>' . $html . '</div>', LIBXML_NOERROR | LIBXML_NOWARNING);
return $domDocument;
}
/**
* Convert a DOMDocument back into an HTML string, which is reasonably close to what we started with.
*
* @param \DOMDocument $domDocument
* @return string - The resulting HTML string
*/
public static function getHtmlFromDomDocument($domDocument)
{
// Convert the DOMDocument back to a string.
$xml = $domDocument->saveXML();
// Strip out the XML declaration, if one exists
$xmlDeclaration = "<?xml version=\"1.0\"?>\n";
if (substr($xml, 0, strlen($xmlDeclaration)) == $xmlDeclaration) {
$xml = substr($xml, strlen($xmlDeclaration));
}
// If the original HTML was empty, loadXML collapses our <div></div> into <div/>. Remove it.
if ($xml == "<div/>\n") {
$xml = '';
}
else {
// Remove the opening <div> tag we previously added, if it exists.
$openDivTag = "<div>";
if (substr($xml, 0, strlen($openDivTag)) == $openDivTag) {
$xml = substr($xml, strlen($openDivTag));
}
// Remove the closing </div> tag we previously added, if it exists.
$closeDivTag = "</div>\n";
$closeChunk = substr($xml, -strlen($closeDivTag));
if ($closeChunk == $closeDivTag) {
$xml = substr($xml, 0, -strlen($closeDivTag));
}
}
return $xml;
}
}
I also wrote some tests which would live in that same class:
我还写了一些测试,这些测试可以放在同一个类中:
public static function testHtmlToDomConversions($content)
{
// test that converting the $content to a DOMDocument and back does not change the HTML
if ($content !== self::getHtmlFromDomDocument(self::getDomDocumentFromHtml($content))) {
echo "Failed\n";
}
else {
echo "Succeeded\n";
}
}
public static function testAll()
{
self::testHtmlToDomConversions('<p>Here is some sample text</p>');
self::testHtmlToDomConversions('<div>Lots of <div>nested <div>divs</div></div></div>');
self::testHtmlToDomConversions('Normal Text');
self::testHtmlToDomConversions(''); //empty
}
You can check that it works for yourself. DomDocumentWorkaround::testAll()
returns this:
您可以检查它是否适合自己。DomDocumentWorkaround::testAll()
返回这个:
Succeeded
Succeeded
Succeeded
Succeeded
回答by rclai
Okay I found a more elegant solution, but it's just tedious:
好的,我找到了一个更优雅的解决方案,但它只是乏味:
$d = new DOMDocument();
@$d->loadHTML($yourcontent);
...
// do your manipulation, processing, etc of it blah blah blah
...
// then to save, do this
$x = new DOMXPath($d);
$everything = $x->query("body/*"); // retrieves all elements inside body tag
if ($everything->length > 0) { // check if it retrieved anything in there
$output = '';
foreach ($everything as $thing) {
$output .= $d->saveXML($thing);
}
echo $output; // voila, no more annoying html wrappers or body tag
}
Alright, hopefully this does not omit anything and helps somebody?
好吧,希望这不会遗漏任何内容并帮助某人?