Javascript 如何关闭未关闭的 HTML 标签?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3059398/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to close unclosed HTML Tags?
提问by Starx
Whenever we are fetching some user inputed content with some editing from the database or similar sources, we might retrieve the portion which only contains the opening tag but no closing.
每当我们从数据库或类似来源通过一些编辑获取一些用户输入的内容时,我们可能会检索仅包含开始标记但不包含结束标记的部分。
This can hamper the website's current layout.
这可能会妨碍网站的当前布局。
Is there a clientside or serverside way of fixing this?
是否有客户端或服务器端的方法来解决这个问题?
采纳答案by KJS
Found a great answer for this one:
找到了一个很好的答案:
Use PHP 5 and use the loadHTML() method of the DOMDocument object. This auto parses badly formed HTML and a subsequent call to saveXML() will output the valid HTML. The DOM functions can be found here:
使用 PHP 5 并使用 DOMDocument 对象的 loadHTML() 方法。这会自动解析格式错误的 HTML,随后对 saveXML() 的调用将输出有效的 HTML。DOM 函数可以在这里找到:
The usage of this:
这个的用法:
$doc = new DOMDocument();
$doc->loadHTML($yourText);
$yourText = $doc->saveHTML();
回答by Gordon
You can use Tidy:
您可以使用整洁:
Tidy is a binding for the Tidy HTML clean and repair utility which allows you to not only clean and otherwise manipulate HTML documents, but also traverse the document tree.
Tidy 是 Tidy HTML 清理和修复实用程序的绑定,它允许您不仅清理和以其他方式操作 HTML 文档,还可以遍历文档树。
or HTMLPurifier
HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.
HTML Purifier 是一个用 PHP 编写的符合标准的 HTML 过滤器库。HTML Purifier 不仅会使用经过彻底审核、安全但允许的白名单删除所有恶意代码(也称为 XSS),还会确保您的文档符合标准,这只有在全面了解 W3C 规范的情况下才能实现。
回答by kamal
I have solution for php
我有 php 的解决方案
<?php
// close opened html tags
function closetags ( $html )
{
#put all opened tags into an array
preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result );
$openedtags = $result[1];
#put all closed tags into an array
preg_match_all ( "#</([a-z]+)>#iU", $html, $result );
$closedtags = $result[1];
$len_opened = count ( $openedtags );
# all tags are closed
if( count ( $closedtags ) == $len_opened )
{
return $html;
}
$openedtags = array_reverse ( $openedtags );
# close tags
for( $i = 0; $i < $len_opened; $i++ )
{
if ( !in_array ( $openedtags[$i], $closedtags ) )
{
$html .= "</" . $openedtags[$i] . ">";
}
else
{
unset ( $closedtags[array_search ( $openedtags[$i], $closedtags)] );
}
}
return $html;
}
// close opened html tags
?>
You can use this function like
你可以像这样使用这个功能
<?php echo closetags("your content <p>test test"); ?>
回答by Arth
For HTML fragments, and working from KJS's answerI have had success with the following when the fragment has one root element:
对于 HTML 片段,并根据KJS 的回答,当片段具有一个根元素时,我已成功执行以下操作:
$dom = new DOMDocument();
$dom->loadHTML($string);
$body = $dom->documentElement->firstChild->firstChild;
$string = $dom->saveHTML($body);
Without a root element this is possible (but seems to wrap only the first text child node in p tags in text <p>para</p> text):
如果没有根元素,这是可能的(但似乎只包装了 p 标签中的第一个文本子节点text <p>para</p> text):
$dom = new DOMDocument();
$dom->loadHTML($string);
$bodyChildNodes = $dom->documentElement->firstChild->childNodes;
$string = '';
foreach ($bodyChildNodes as $node){
$string .= $dom->saveHTML($node);
}
Or better yet, from PHP >= 5.4 and libxml >= 2.7.8 (2.7.7 for LIBXML_HTML_NOIMPLIED):
或者更好的是,从 PHP >= 5.4 和 libxml >= 2.7.8(2.7.7 for LIBXML_HTML_NOIMPLIED):
$dom = new DOMDocument();
// Load with no html/body tags and do not add a default dtd
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$string = $dom->saveHTML();
回答by Andrew
In addition to server-side tools like Tidy, you can also use the user's browser to do some of the cleanup for you. One of the really great things about innerHTMLis that it will apply the same on-the-fly repair to dynamic content as it does to HTML pages. This code works pretty well (with two caveats) and nothing actually gets written to the page:
除了像 Tidy 这样的服务器端工具,您还可以使用用户的浏览器为您做一些清理工作。真正伟大的事情之一innerHTML是它将对动态内容应用与对 HTML 页面相同的即时修复。这段代码工作得很好(有两个警告),实际上没有任何内容写入页面:
var divTemp = document.createElement('div');
divTemp.innerHTML = '<p id="myPara">these <i>tags aren\'t <strong> closed';
console.log(divTemp.innerHTML);
The caveats:
警告:
The different browsers will return different strings. This isn't so bad, except in the the case of IE, which will return capitalized tags and will strip the quotes from tag attributes, which will not pass validation. The solution here is to do some simple clean-up on the server side. But at least the document will be properly structured XML.
I suspect that you may have to put in a delay before reading the innerHTML -- give the browser a chance to digest the string -- or you risk getting back exactly what was put in. I just tried on IE8 and it looks like the string gets parsed immediately, but I'm not so sure on IE6. It would probably be best to read the innerHTML after a delay (or throw it into a setTimeout() to force it to the end of the queue).
不同的浏览器会返回不同的字符串。这还不错,除了在 IE 的情况下,它会返回大写的标签并从标签属性中去除引号,这将不会通过验证。这里的解决方案是在服务器端做一些简单的清理。但至少文档将是结构正确的 XML。
我怀疑您可能不得不在阅读 innerHTML 之前延迟 - 让浏览器有机会消化字符串 - 否则您可能会准确地取回放入的内容。我刚刚在 IE8 上尝试过,它看起来像字符串立即被解析,但我对 IE6 不太确定。最好在延迟后读取 innerHTML(或将其放入 setTimeout() 以强制它到队列的末尾)。
I would recommend you take @Gordon's advice and use Tidy if you have access to it (it takes less work to implement) and failing that, use innerHTML and write your own tidy function in PHP.
我建议你接受@Gordon 的建议并使用 Tidy,如果你可以访问它(它需要更少的工作来实现)并且失败了,使用 innerHTML 并在 PHP 中编写你自己的 tidy 函数。
And though this isn't part of your question, as this is for a CMS, consider also using the YUI 2 Rich Text Editorfor stuff like this. It's fairly easy to implement, somewhat easy to customize, the interface is very familiar to most users, and it spits out perfectly valid code. There are several other off-the-shelf rich text editors out there, but YUI has the best license and is the most powerful I've seen.
尽管这不是您问题的一部分,因为这是针对 CMS 的,但也可以考虑使用YUI 2 Rich Text Editor来处理此类问题。它相当容易实现,有点容易定制,大多数用户都非常熟悉该界面,并且它会输出完全有效的代码。还有其他几种现成的富文本编辑器,但 YUI 拥有最好的许可证,也是我见过的最强大的。
回答by Marcus
A better PHP function to delete not open/not closed tags from webmaster-glossar.de (me)
一个更好的 PHP 函数,可以从 webmaster-glossar.de (me) 中删除未打开/未关闭的标签
function closetag($html){
$html_new = $html;
preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result1);
preg_match_all ( "#</([a-z]+)>#iU", $html, $result2);
$results_start = $result1[1];
$results_end = $result2[1];
foreach($results_start AS $startag){
if(!in_array($startag, $results_end)){
$html_new = str_replace('<'.$startag.'>', '', $html_new);
}
}
foreach($results_end AS $endtag){
if(!in_array($endtag, $results_start)){
$html_new = str_replace('</'.$endtag.'>', '', $html_new);
}
}
return $html_new;
}
use this function like:
使用此功能,如:
closetag('i <b>love</b> my <strike>cat');
#output: i <b>love</b> my cat
closetag('i <b>love</b> my cat</strike>');
#output: i <b>love</b> my cat
回答by Luke Madhanga
I used to the native DOMDocument method, but with a few improvements for safety.
我习惯使用原生 DOMDocument 方法,但为了安全性做了一些改进。
Note, other answers that use DOMDocument do not consider html strands such as
请注意,使用 DOMDocument 的其他答案不考虑 html 链,例如
This is a <em>HTML</em> strand
The above will actually result in
以上实际上会导致
<p>This is a <em>HTML</em> strand
My Solution is below
我的解决方案如下
function closeDanglingTags($html) {
if (strpos($html, '<') || strpos($html, '>')) {
// There are definitiley HTML tags
$wrapped = false;
if (strpos(trim($html), '<') !== 0) {
// The HTML starts with a text node. Wrap it in an element with an id to prevent the software wrapping it with a <p>
// that we know nothing about and cannot safely retrieve
$html = cHE::getDivHtml($html, null, 'closedanglingtagswrapper');
$wrapped = true;
}
$doc = new DOMDocument();
$doc->encoding = 'utf-8';
@$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
if ($doc->firstChild) {
// Test whether the firstchild is definitely a DOMDocumentType
if ($doc->firstChild instanceof DOMDocumentType) {
// Remove the added doctype
$doc->removeChild($doc->firstChild);
}
}
if ($wrapped) {
// The contents originally started with a text node and was wrapped in a div#plasmappclibtextwrap. Take the contents
// out of that div
$node = $doc->getElementById('closedanglingtagswrapper');
$children = $node->childNodes; // The contents of the div. Equivalent to $('selector').children()
$doc = new DOMDocument(); // Create a new document to add the contents to, equiv. to "var doc = $('<html></html>');"
foreach ($children as $childnode) {
$doc->appendChild($doc->importNode($childnode, true)); // E.g. doc.append()
}
}
// Remove the added html,body tags
return trim(str_replace(array('<html><body>', '</body></html>'), '', html_entity_decode($doc->saveHTML())));
} else {
return $html;
}
}
回答by Robert
Erik Arvidsson wrote a nice HTML SAX parser in 2004. http://erik.eae.net/archives/2004/11/20/12.18.31/
埃里克·阿维德森在2004年写了一个不错的HTML SAX解析器http://erik.eae.net/archives/2004/11/20/12.18.31/
It keeps track of the the open tags, so with a minimalistic SAX handler it's possible to insert closing tags at the correct position:
它跟踪打开的标签,因此使用简约的 SAX 处理程序可以在正确的位置插入结束标签:
function tidyHTML(html) {
var output = '';
HTMLParser(html, {
comment: function(text) {
// filter html comments
},
chars: function(text) {
output += text;
},
start: function(tagName, attrs, unary) {
output += '<' + tagName;
for (var i = 0; i < attrs.length; i++) {
output += ' ' + attrs[i].name + '=';
if (attrs[i].value.indexOf('"') === -1) {
output += '"' + attrs[i].value + '"';
} else if (attrs[i].value.indexOf('\'') === -1) {
output += '\'' + attrs[i].value + '\'';
} else { // value contains " and ' so it cannot contain spaces
output += attrs[i].value;
}
}
output += '>';
},
end: function(tagName) {
output += '</' + tagName + '>';
}
});
return output;
}

