php 获取没有 DOCTYPE、HTML、HEAD 和 BODY 标签的 BODY 内容
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11254619/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get contents of BODY without DOCTYPE, HTML, HEAD and BODY tags
提问by enrico pax
What I am trying to do is include an HTML file within a PHP system (not a problem) but that HTML file also needs to be usable on its own, for various reasons, so I need to know how I can strip the doctype, html, head and body tags in the context of the PHP include, if that's possible.
我想要做的是在 PHP 系统中包含一个 HTML 文件(不是问题),但由于各种原因,该 HTML 文件也需要单独使用,所以我需要知道如何去除 doctype、html如果可能,PHP 上下文中的 、head 和 body 标记包括在内。
I'm not particularly good at PHP (doh!) so my searches of the php manual and on the web hasn't made me figure this out. Meaning that any help or reading tips, or both, are much appreciated.
我不是特别擅长 PHP(doh!)所以我在 php 手册和网络上的搜索并没有让我弄清楚这一点。这意味着非常感谢任何帮助或阅读技巧,或两者兼而有之。
回答by Jared Farrish
Since the substr()method seemed to be too much for some to swallow, here is a DOM parser method:
由于该substr()方法对于某些人来说似乎太多了,因此这里是一个 DOM 解析器方法:
$d = new DOMDocument;
$mock = new DOMDocument;
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$mock->appendChild($mock->importNode($child, true));
}
echo $mock->saveHTML();
Anybody wish to see that "other one", see the revisions.
任何人都希望看到“另一个”,请参阅修订版。
回答by Patrick
$site = file_get_contents("http://www.google.com/");
preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);
echo($matches[1]);
回答by Ja?ck
Use DOMDocument to keep what you need rather than strip what you don't need (PHP >= 5.3.6)
使用 DOMDocument 保留您需要的内容而不是删除您不需要的内容(PHP >= 5.3.6)
$d = new DOMDocument;
$d->loadHTMLFile($fileLocation);
$body = $d->getElementsByTagName('body')->item(0);
// perform innerhtml on $body by enumerating child nodes
// and saving them individually
foreach ($body->childNodes as $childNode) {
echo $d->saveHTML($childNode);
}
回答by tobyodavies
Use a DOM parser. this is not tested but ought to do what you want
使用 DOM 解析器。这没有经过测试,但应该做你想做的
$domDoc = new DOMDocument();
$domDoc.loadHTMLFile('/path/to/file');
$body = $domDoc->GetElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
echo $child->C14N(); //Note this cannonicalizes the representation of the node, but that's not necessarily a bad thing
}
If you want to avoid cannonicalization, you can use this version(thanks to @Jared Farrish)
如果你想避免规范化,你可以使用这个版本(感谢@Jared Farrish)
回答by lubosdz
You may want to use PHP tidy extension which can fix invalid XHTML structures (in which case DOMDocument load crashes) and also extract body only:
您可能想要使用 PHP tidy 扩展,它可以修复无效的 XHTML 结构(在这种情况下 DOMDocument 加载崩溃)并且还仅提取正文:
$tidy = new tidy();
$htmlBody = $tidy->repairString($html, array(
'output-xhtml' => true,
'show-body-only' => true,
), 'utf8');
Then load extracted body into DOMDocument:
然后将提取的主体加载到 DOMDocument 中:
$xml = new DOMDocument();
$xml->loadHTML($htmlBody);
Then traverse, extract, move around XML nodes etc .. and save:
然后遍历、提取、移动 XML 节点等......并保存:
$output = $xml->saveXML();
回答by Luca Vizzi
A solution with only one instance of DOMDocument and without loops
只有一个 DOMDocument 实例且没有循环的解决方案
$d = new DOMDocument();
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
echo $d->saveHTML($body);
回答by Luca Vizzi
This may be a solution. I tried it and it works fine.
这可能是一个解决方案。我试过了,效果很好。
function parseHTML(string) {
var parser = new DOMParser
, result = parser.parseFromString(string, "text/html");
return result.firstChild.lastChild.firstChild;
}

