javascript 我可以在 Internet Explorer 中将整个 HTML 文档加载到文档片段中吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7474710/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Can I load an entire HTML document into a document fragment in Internet Explorer?
提问by Andy E
Here's something I've been having a little bit of difficulty with. I have a local client-side script that needs to allow a user to fetch a remote web page and search that resulting page for forms. In order to do this (without regex), I need to parse the document into a fully traversable DOM object.
这是我一直有点困难的事情。我有一个本地客户端脚本,需要允许用户获取远程网页并在生成的页面中搜索表单。为了做到这一点(没有正则表达式),我需要将文档解析为一个完全可遍历的 DOM 对象。
Some limitations I'd like to stress:
我想强调的一些限制:
- I don't want to use libraries (like jQuery). There's too much bloat for what I need to do here.
- Under no circumstances should scripts from the remote page be executed (for security reasons).
- DOM APIs, such as
getElementsByTagName
, need to be available. - It only needs to work in Internet Explorer, but in 7 at the very least.
- Let's pretend I don't have access to a server. I do, but I can't use it for this.
- 我不想使用库(如 jQuery)。我在这里需要做的事情太多了。
- 在任何情况下都不应执行来自远程页面的脚本(出于安全原因)。
- DOM API,例如
getElementsByTagName
,需要可用。 - 它只需要在 Internet Explorer 中运行,但至少需要在 7 中运行。
- 假设我无权访问服务器。我有,但我不能用它来做这个。
What I've tried
我试过的
Assuming I have a complete HTML document string (including DOCTYPE declaration) in the variable html
, here's what I've tried so far:
假设我在变量中有一个完整的 HTML 文档字符串(包括 DOCTYPE 声明)html
,这是我迄今为止尝试过的:
var frag = document.createDocumentFragment(),
div = frag.appendChild(document.createElement("div"));
div.outerHTML = html;
//-> results in an empty fragment
div.insertAdjacentHTML("afterEnd", html);
//-> HTML is not added to the fragment
div.innerHTML = html;
//-> Error (expected, but I tried it anyway)
var doc = new ActiveXObject("htmlfile");
doc.write(html);
doc.close();
//-> JavaScript executes
I've also tried extracting the <head>
and <body>
nodes from the HTML and adding them to a <HTML>
element inside the fragment, still no luck.
我还尝试从 HTML 中提取<head>
和<body>
节点并将它们添加到<HTML>
片段内的元素中,但仍然没有运气。
Does anyone have any ideas?
有没有人有任何想法?
回答by Rob W
Fiddle: http://jsfiddle.net/JFSKe/6/
小提琴:http: //jsfiddle.net/JFSKe/6/
DocumentFragment
doesn't implement DOM methods. Using document.createElement
in conjunction with innerHTML
removes the <head>
and <body>
tags (even when the created element is a root element, <html>
). Therefore, the solution should be sought elsewhere. I have created a cross-browserstring-to-DOM function, which makes use of an invisible inline-frame.
DocumentFragment
不实现 DOM 方法。使用document.createElement
与结合 innerHTML
去除<head>
和<body>
标签(即使当创建的元素是一个根元素,<html>
)。因此,应在别处寻求解决方案。我创建了一个跨浏览器字符串到 DOM 的函数,它利用了一个不可见的内联框架。
All external resources and scripts will be disabled. See Explanation of the codefor more information.
所有外部资源和脚本都将被禁用。有关更多信息,请参阅代码说明。
Code
代码
/*
@param String html The string with HTML which has be converted to a DOM object
@param func callback (optional) Callback(HTMLDocument doc, function destroy)
@returns undefined if callback exists, else: Object
HTMLDocument doc DOM fetched from Parameter:html
function destroy Removes HTMLDocument doc. */
function string2dom(html, callback){
/* Sanitise the string */
html = sanitiseHTML(html); /*Defined at the bottom of the answer*/
/* Create an IFrame */
var iframe = document.createElement("iframe");
iframe.style.display = "none";
document.body.appendChild(iframe);
var doc = iframe.contentDocument || iframe.contentWindow.document;
doc.open();
doc.write(html);
doc.close();
function destroy(){
iframe.parentNode.removeChild(iframe);
}
if(callback) callback(doc, destroy);
else return {"doc": doc, "destroy": destroy};
}
/* @name sanitiseHTML
@param String html A string representing HTML code
@return String A new string, fully stripped of external resources.
All "external" attributes (href, src) are prefixed by data- */
function sanitiseHTML(html){
/* Adds a <!-\"'--> before every matched tag, so that unterminated quotes
aren't preventing the browser from splitting a tag. Test case:
'<input style="foo;b:url(0);><input onclick="<input type=button onclick="too() href=;>">' */
var prefix = "<!--\"'-->";
/*Attributes should not be prefixed by these characters. This list is not
complete, but will be sufficient for this function.
(see http://www.w3.org/TR/REC-xml/#NT-NameChar) */
var att = "[^-a-z0-9:._]";
var tag = "<[a-z]";
var any = "(?:[^<>\"']*(?:\"[^\"]*\"|'[^']*'))*?[^<>]*";
var etag = "(?:>|(?=<))";
/*
@name ae
@description Converts a given string in a sequence of the
original input and the HTML entity
@param String string String to convert
*/
var entityEnd = "(?:;|(?!\d))";
var ents = {" ":"(?:\s| ?|�*32"+entityEnd+"|�*20"+entityEnd+")",
"(":"(?:\(|�*40"+entityEnd+"|�*28"+entityEnd+")",
")":"(?:\)|�*41"+entityEnd+"|�*29"+entityEnd+")",
".":"(?:\.|�*46"+entityEnd+"|�*2e"+entityEnd+")"};
/*Placeholder to avoid tricky filter-circumventing methods*/
var charMap = {};
var s = ents[" "]+"*"; /* Short-hand space */
/* Important: Must be pre- and postfixed by < and >. RE matches a whole tag! */
function ae(string){
var all_chars_lowercase = string.toLowerCase();
if(ents[string]) return ents[string];
var all_chars_uppercase = string.toUpperCase();
var RE_res = "";
for(var i=0; i<string.length; i++){
var char_lowercase = all_chars_lowercase.charAt(i);
if(charMap[char_lowercase]){
RE_res += charMap[char_lowercase];
continue;
}
var char_uppercase = all_chars_uppercase.charAt(i);
var RE_sub = [char_lowercase];
RE_sub.push("�*" + char_lowercase.charCodeAt(0) + entityEnd);
RE_sub.push("�*" + char_lowercase.charCodeAt(0).toString(16) + entityEnd);
if(char_lowercase != char_uppercase){
RE_sub.push("�*" + char_uppercase.charCodeAt(0) + entityEnd);
RE_sub.push("�*" + char_uppercase.charCodeAt(0).toString(16) + entityEnd);
}
RE_sub = "(?:" + RE_sub.join("|") + ")";
RE_res += (charMap[char_lowercase] = RE_sub);
}
return(ents[string] = RE_res);
}
/*
@name by
@description second argument for the replace function.
*/
function by(match, group1, group2){
/* Adds a data-prefix before every external pointer */
return group1 + "data-" + group2
}
/*
@name cr
@description Selects a HTML element and performs a
search-and-replace on attributes
@param String selector HTML substring to match
@param String attribute RegExp-escaped; HTML element attribute to match
@param String marker Optional RegExp-escaped; marks the prefix
@param String delimiter Optional RegExp escaped; non-quote delimiters
@param String end Optional RegExp-escaped; forces the match to
end before an occurence of <end> when
quotes are missing
*/
function cr(selector, attribute, marker, delimiter, end){
if(typeof selector == "string") selector = new RegExp(selector, "gi");
marker = typeof marker == "string" ? marker : "\s*=";
delimiter = typeof delimiter == "string" ? delimiter : "";
end = typeof end == "string" ? end : "";
var is_end = end && "?";
var re1 = new RegExp("("+att+")("+attribute+marker+"(?:\s*\"[^\""+delimiter+"]*\"|\s*'[^'"+delimiter+"]*'|[^\s"+delimiter+"]+"+is_end+")"+end+")", "gi");
html = html.replace(selector, function(match){
return prefix + match.replace(re1, by);
});
}
/*
@name cri
@description Selects an attribute of a HTML element, and
performs a search-and-replace on certain values
@param String selector HTML element to match
@param String attribute RegExp-escaped; HTML element attribute to match
@param String front RegExp-escaped; attribute value, prefix to match
@param String flags Optional RegExp flags, default "gi"
@param String delimiter Optional RegExp-escaped; non-quote delimiters
@param String end Optional RegExp-escaped; forces the match to
end before an occurence of <end> when
quotes are missing
*/
function cri(selector, attribute, front, flags, delimiter, end){
if(typeof selector == "string") selector = new RegExp(selector, "gi");
flags = typeof flags == "string" ? flags : "gi";
var re1 = new RegExp("("+att+attribute+"\s*=)((?:\s*\"[^\"]*\"|\s*'[^']*'|[^\s>]+))", "gi");
end = typeof end == "string" ? end + ")" : ")";
var at1 = new RegExp('(")('+front+'[^"]+")', flags);
var at2 = new RegExp("(')("+front+"[^']+')", flags);
var at3 = new RegExp("()("+front+'(?:"[^"]+"|\'[^\']+\'|(?:(?!'+delimiter+').)+)'+end, flags);
var handleAttr = function(match, g1, g2){
if(g2.charAt(0) == '"') return g1+g2.replace(at1, by);
if(g2.charAt(0) == "'") return g1+g2.replace(at2, by);
return g1+g2.replace(at3, by);
};
html = html.replace(selector, function(match){
return prefix + match.replace(re1, handleAttr);
});
}
/* <meta http-equiv=refresh content=" ; url= " > */
html = html.replace(new RegExp("<meta"+any+att+"http-equiv\s*=\s*(?:\""+ae("refresh")+"\""+any+etag+"|'"+ae("refresh")+"'"+any+etag+"|"+ae("refresh")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "gi"), "<!-- meta http-equiv=refresh stripped-->");
/* Stripping all scripts */
html = html.replace(new RegExp("<script"+any+">\s*//\s*<\[CDATA\[[\S\s]*?]]>\s*</script[^>]*>", "gi"), "<!--CDATA script-->");
html = html.replace(/<script[\S\s]+?<\/script\s*>/gi, "<!--Non-CDATA script-->");
cr(tag+any+att+"on[-a-z0-9:_.]+="+any+etag, "on[-a-z0-9:_.]+"); /* Event listeners */
cr(tag+any+att+"href\s*="+any+etag, "href"); /* Linked elements */
cr(tag+any+att+"src\s*="+any+etag, "src"); /* Embedded elements */
cr("<object"+any+att+"data\s*="+any+etag, "data"); /* <object data= > */
cr("<applet"+any+att+"codebase\s*="+any+etag, "codebase"); /* <applet codebase= > */
/* <param name=movie value= >*/
cr("<param"+any+att+"name\s*=\s*(?:\""+ae("movie")+"\""+any+etag+"|'"+ae("movie")+"'"+any+etag+"|"+ae("movie")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "value");
/* <style> and < style= > url()*/
cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "url", "\s*\(\s*", "", "\s*\)");
cri(tag+any+att+"style\s*="+any+etag, "style", ae("url")+s+ae("(")+s, 0, s+ae(")"), ae(")"));
/* IE7- CSS expression() */
cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "expression", "\s*\(\s*", "", "\s*\)");
cri(tag+any+att+"style\s*="+any+etag, "style", ae("expression")+s+ae("(")+s, 0, s+ae(")"), ae(")"));
return html.replace(new RegExp("(?:"+prefix+")+", "g"), prefix);
}
Explanation of the code
代码说明
The sanitiseHTML
function is based on my replace_all_rel_by_abs
function (see this answer). The sanitiseHTML
function is completely rewritten though, in order to achieve maximum efficiency and reliability.
该sanitiseHTML
功能基于我的replace_all_rel_by_abs
功能(请参阅此答案)。该sanitiseHTML
函数被完全重写,以实现最大的效率和可靠性。
Additionally, a new set of RegExps are added to remove all scripts and event handlers (including CSS expression()
, IE7-). To make sure that all tags are parsed as expected, the adjusted tags are prefixed by <!--'"-->
. This prefix is necessary to correctly parse nested "event handlers" in conjunction with unterminated quotes: <a id="><input onclick="<div onmousemove=evil()>">
.
此外,还添加了一组新的 RegExp,以删除所有脚本和事件处理程序(包括 CSS expression()
、IE7-)。为确保所有标签都按预期解析,调整后的标签以<!--'"-->
. 这个前缀对于正确解析嵌套的“事件处理程序”以及未终止的引号是必要的:<a id="><input onclick="<div onmousemove=evil()>">
。
These RegExps are dynamically created using an internal function cr
/cri
(Create Replace [Inline]). These functions accept a list of arguments, and create and execute an advanced RE replacement. To make sure that HTML entities aren't breaking a RegExp (refresh
in <meta http-equiv=refresh>
could be written in various ways), the dynamically created RegExps are partially constructed by function ae
(Any Entity).
The actual replacements are done by function by
(replace by). In this implementation, by
adds data-
before all matched attributes.
这些正则表达式是使用内部函数动态创建cr
/ cri
(çreate řE放置[我n第])。这些函数接受参数列表,并创建和执行高级 RE 替换。为了确保HTML实体没有违反一个RegExp(refresh
在<meta http-equiv=refresh>
可以用各种方式来写的),动态创建的正则表达式的一部分被构造函数ae
(一个纽约êntity)。
实际的替换是按函数完成的by
(替换为)。在这个实现中,在所有匹配的属性之前by
添加data-
。
- All
<script>//<[CDATA[ .. //]]></script>
occurrences are striped. This step is necessary, becauseCDATA
sections allow</script>
strings inside the code. After this replacement has been executed, it's safe to go to the next replacement: - The remaining
<script>...</script>
tags are removed. - The
<meta http-equiv=refresh .. >
tag is removed Allevent listeners and external pointers/attributes (
href
,src
,url()
) are prefixed bydata-
, as described previously.An
IFrame
object is created. IFrames are less likely to leak memory (contrary to the htmlfile ActiveXObject). The IFrame becomes invisible, and is appended to the document, so that the DOM can be accessed.document.write()
are used to write HTML to the IFrame.document.open()
anddocument.close()
are used to empty the previous contents of the document, so that the generated document is an exact copy of the givenhtml
string.- If a callback function has been specified, the function will be called with two arguments. The firstargument is a reference to the generated
document
object. The secondargument is a function, which destroys the generated DOM tree when called. This function should be called when you don't need the tree any more.
If the callback function isn't specified, the function returns an object consisting of two properties (doc
anddestroy
), which behave the same as the previously mentioned arguments.
- 所有的
<script>//<[CDATA[ .. //]]></script>
出现都是条带化的。这一步是必要的,因为CDATA
部分允许</script>
在代码中使用 字符串。执行此替换后,可以安全地进行下一个替换: - 剩余的
<script>...</script>
标签被移除。 - 该
<meta http-equiv=refresh .. >
标记将被删除 如前所述,所有事件侦听器和外部指针/属性(
href
、src
、url()
)都以 为前缀data-
。一个
IFrame
对象被创建。IFrames 不太可能泄漏内存(与 htmlfile ActiveXObject 相反)。IFrame 变得不可见,并附加到文档中,以便可以访问 DOM。document.write()
用于将 HTML 写入 IFrame。document.open()
和document.close()
用于清空文档的先前内容,以便生成的文档是给定html
字符串的精确副本。- 如果已指定回调函数,则将使用两个参数调用该函数。第一个参数是对生成
document
对象的引用。该第二参数是一个函数被调用时它破坏所生成的DOM树。当您不再需要树时,应调用此函数。
如果未指定回调函数,该函数将返回一个由两个属性 (doc
和destroy
)组成的对象,其行为与前面提到的参数相同。
Additional notes
补充笔记
- Setting the
designMode
property to "On" will stop a frame from executing scripts (not supported in Chrome). If you have to preserve the<script>
tags for a specific reason, you can useiframe.designMode = "On"
instead of the script stripping feature. - I wasn't able to find a reliable source for the
htmlfile activeXObject
. According to this source,htmlfile
is slower than IFrames, and more susceptible to memory leaks. - All affected attributes (
href
,src
, ...) are prefixed bydata-
. An example of getting/changing these attributes is shown fordata-href
:elem.getAttribute("data-href")
andelem.setAttribute("data-href", "...")
elem.dataset.href
andelem.dataset.href = "..."
. - External resources have been disabled. As a result, the page may look completely different:
No external styles<link rel="stylesheet" href="main.css" />
No scripted styles<script>document.body.bgColor="red";</script>
<img src="128x128.png" />
No images: the size of the element may be completely different.
- 将该
designMode
属性设置为“On”将阻止框架执行脚本(Chrome 不支持)。如果<script>
出于特定原因必须保留标签,则可以使用iframe.designMode = "On"
代替脚本剥离功能。 - 我无法找到
htmlfile activeXObject
. 根据这个来源,htmlfile
比 IFrames 慢,并且更容易受到内存泄漏的影响。 - 所有受影响的属性(
href
,src
, ...)都以 为前缀data-
。获得/改变这些属性中的一个例子示出了用于data-href
:elem.getAttribute("data-href")
和elem.setAttribute("data-href", "...")
elem.dataset.href
和elem.dataset.href = "..."
。 - 外部资源已被禁用。因此,页面可能看起来完全不同:没有外部样式没有脚本样式没有图像: 元素的大小可能完全不同。
<link rel="stylesheet" href="main.css" />
<script>document.body.bgColor="red";</script>
<img src="128x128.png" />
Examples
例子
sanitiseHTML(html)
Paste this bookmarklet in the location's bar. It will offer an option to inject a textarea, showing the sanitised HTML string.
sanitiseHTML(html)
将此书签粘贴到该位置的栏中。它将提供一个注入 textarea 的选项,显示经过处理的 HTML 字符串。
javascript:void(function(){var s=document.createElement("script");s.src="http://rob.lekensteyn.nl/html-sanitizer.js";document.body.appendChild(s)})();
Code examples - string2dom(html)
:
代码示例 -string2dom(html)
:
string2dom("<html><head><title>Test</title></head></html>", function(doc, destroy){
alert(doc.title); /* Alert: "Test" */
destroy();
});
var test = string2dom("<div id='secret'></div>");
alert(test.doc.getElementById("secret").tagName); /* Alert: "DIV" */
test.destroy();
Notable references
值得注意的参考资料
- SO: JS RE to change all relative to absolute URLs- Function
sanitiseHTML(html)
is based on my previously createdreplace_all_rel_by_abs(html)
function. - Elements - Embedded content- A full list of standard embedded elements
- Elements - Previous HTML elements- An additional list of (deprecated) elements (such as
<applet>
) - The htmlfile ActiveX object- "Slower than iframe sandboxes. Leaks memory if not managed"
- SO:JS RE 更改所有相对于绝对 URL 的内容- 函数
sanitiseHTML(html)
基于我之前创建的replace_all_rel_by_abs(html)
函数。 - 元素 - 嵌入内容- 标准嵌入元素的完整列表
- 元素 - 以前的 HTML 元素-(不推荐使用的)元素的附加列表(例如
<applet>
) - htmlfile ActiveX 对象- “比 iframe 沙箱慢。如果不管理会泄漏内存”
回答by Chris Baker
Not sure why you're messing with documentFragments, you can just set the HTML text as the innerHTML
of a new div element. Then you can use that div element for getElementsByTagName
etc without adding the div to DOM:
不确定为什么要弄乱 documentFragments,您可以将 HTML 文本设置为innerHTML
新 div 元素的 。然后您可以将该 div 元素用于getElementsByTagName
etc 而无需将 div 添加到 DOM:
var htmlText= '<html><head><title>Test</title></head><body><div id="test_ele1">this is test_ele1 content</div><div id="test_ele2">this is test_ele content2</div></body></html>';
var d = document.createElement('div');
d.innerHTML = htmlText;
console.log(d.getElementsByTagName('div'));
If you're really married to the idea of a documentFragment, you can use this code, but you'll still have to wrap it in a div to get the DOM functions you're after:
如果你真的很喜欢 documentFragment 的想法,你可以使用这段代码,但你仍然需要将它包装在一个 div 中才能获得你想要的 DOM 函数:
function makeDocumentFragment(htmlText) {
var range = document.createRange();
var frag = range.createContextualFragment(htmlText);
var d = document.createElement('div');
d.appendChild(frag);
return d;
}
回答by Eli Grey
I'm not sure if IE supports document.implementation.createHTMLDocument
, but if it does, use this algorithm (adapted from my DOMParser HTML extension). Note that the DOCTYPE will not be preserved.:
我不确定 IE 是否支持document.implementation.createHTMLDocument
,但如果支持,请使用此算法(改编自我的DOMParser HTML 扩展)。请注意,不会保留 DOCTYPE。:
var
doc = document.implementation.createHTMLDocument("")
, doc_elt = doc.documentElement
, first_elt
;
doc_elt.innerHTML = your_html_here;
first_elt = doc_elt.firstElementChild;
if ( // are we dealing with an entire document or a fragment?
doc_elt.childElementCount === 1
&& first_elt.tagName.toLowerCase() === "html"
) {
doc.replaceChild(first_elt, doc_elt);
}
// doc is an HTML document
// you can now reference stuff like doc.title, etc.
回答by Dr.Molle
回答by Javier Pedemonte
DocumentFragment
doesn't support getElementsByTagName
-- that's only supported by Document
.
DocumentFragment
不支持getElementsByTagName
- 只有Document
.
You may need to use a library like jsdom, which provides an implementation of the DOM and through which you can search using getElementsByTagName
and other DOM APIs. And you can set it to not execute scripts. Yes, it's 'heavy' and I don't know if it works in IE 7.
您可能需要使用像jsdom这样的库,它提供了 DOM 的实现,您可以通过它来搜索使用getElementsByTagName
和其他 DOM API。您可以将其设置为不执行脚本。是的,它很“重”,我不知道它是否适用于 IE 7。
回答by Pebbl
Just wandered across this page, am a bit late to be of any use :) but the following should help anyone with a similar problem in future... however IE7/8 should really be ignored by now and there are much better methods supported by the more modern browsers.
只是在这个页面上徘徊,有点晚了:) 但是以下内容应该可以帮助将来遇到类似问题的任何人......但是现在应该忽略 IE7/8 并且有更好的方法支持更现代的浏览器。
The following works across nearly eveything I've tested - the only two down sides are:
以下几乎适用于我测试过的所有内容 - 唯一的两个缺点是:
I've added bespoke
getElementById
andgetElementsByName
functions to the root div element, so these wont appear as expected futher down the tree (unless the code is modified to cater for this).The doctype will be ignored - however I don't think this will make much difference as my experience is that the doctype wont effect how the dom is structured, just how it is rendered (which obviously wont happen with this method).
我已经向根 div 元素添加了定制
getElementById
和getElementsByName
函数,所以这些不会像预期的那样在树下出现(除非修改代码以适应这一点)。doctype 将被忽略 - 但是我认为这不会有太大的不同,因为我的经验是 doctype 不会影响 dom 的结构,只会影响它的呈现方式(显然这种方法不会发生)。
Basically the system relies on the fact that <tag>
and <namespace:tag>
are treated differently by the useragents. As has been found certain special tags can not exist within a div element, and so therefore they are removed. Namespaced elements can be placed anywhere (unless there is a DTD stating otherwise). Whilst these namespace tags wont actually behave as the real tags in question, considering we are only really using them for their structural position in the document it doesn't really cause a problem.
基本上系统依赖于这样一个事实,即<tag>
和<namespace:tag>
被用户代理区别对待。已经发现某些特殊标签不能存在于 div 元素中,因此它们被删除。命名空间元素可以放在任何地方(除非有 DTD 另有说明)。虽然这些命名空间标签实际上不会像所讨论的真正标签那样工作,但考虑到我们只是将它们真正用于文档中的结构位置,这并不会真正引起问题。
markup and code are as follows:
标记和代码如下:
<!DOCTYPE html>
<html>
<head>
<script>
/// function for parsing HTML source to a dom structure
/// Tested in Mac OSX, Win 7, Win XP with FF, IE 7/8/9,
/// Chrome, Safari & Opera.
function parseHTML(src){
/// create a random div, this will be our root
var div = document.createElement('div'),
/// specificy our namespace prefix
ns = 'faux:',
/// state which tags we will treat as "special"
stn = ['html','head','body','title'];
/// the reg exp for replacing the special tags
re = new RegExp('<(/?)('+stn.join('|')+')([^>]*)?>','gi'),
/// remember the getElementsByTagName function before we override it
gtn = div.getElementsByTagName;
/// a quick function to namespace certain tag names
var nspace = function(tn){
if ( stn.indexOf ) {
return stn.indexOf(tn) != -1 ? ns + tn : tn;
}
else {
return ('|'+stn.join('|')+'|').indexOf(tn) != -1 ? ns + tn : tn;
}
};
/// search and replace our source so that special tags are namespaced
/// required for IE7/8 to render tags before first text found
/// <faux:check /> tag added so we can test how namespaces work
src = ' <'+ns+'check />' + src.replace(re,'<'+ns+'>');
/// inject to the div
div.innerHTML = src;
/// quick test to see how we support namespaces in TagName searches
if ( !div.getElementsByTagName(ns+'check').length ) {
ns = '';
}
/// create our replacement getByName and getById functions
var createGetElementByAttr = function(attr, collect){
var func = function(a,w){
var i,c,e,f,l,o; w = w||[];
if ( this.nodeType == 1 ) {
if ( this.getAttribute(attr) == a ) {
if ( collect ) {
w.push(this);
}
else {
return this;
}
}
}
else {
return false;
}
if ( (c = this.childNodes) && (l = c.length) ) {
for( i=0; i<l; i++ ){
if( (e = c[i]) && (e.nodeType == 1) ) {
if ( (f = func.call( e, a, w )) && !collect ) {
return f;
}
}
}
}
return (w.length?w:false);
}
return func;
}
/// apply these replacement functions to the div container, obviously
/// you could add these to prototypes for browsers the support element
/// constructors. For other browsers you could step each element and
/// apply the functions through-out the node tree... however this would
/// be quite messy, far better just to always call from the root node -
/// or use div.getElementsByTagName.call( localElement, 'tag' );
div.getElementsByTagName = function(t){return gtn.call(this,nspace(t));}
div.getElementsByName = createGetElementByAttr('name', true);
div.getElementById = createGetElementByAttr('id', false);
/// return the final element
return div;
}
window.onload = function(){
/// parse the HTML source into a node tree
var dom = parseHTML( document.getElementById('source').innerHTML );
/// test some look ups :)
var a = dom.getElementsByTagName('head'),
b = dom.getElementsByTagName('title'),
c = dom.getElementsByTagName('script'),
d = dom.getElementById('body');
/// alert the result
alert(a[0].innerHTML);
alert(b[0].innerHTML);
alert(c[0].innerHTML);
alert(d.innerHTML);
}
</script>
</head>
<body>
<xmp id="source">
<!DOCTYPE html>
<html>
<head>
<!-- Comment //-->
<meta charset="utf-8">
<meta name="robots" content="index, follow">
<title>An example</title>
<link href="test.css" />
<script>alert('of parsing..');</script>
</head>
<body id="body">
<b>in a similar way to createDocumentFragment</b>
</body>
</html>
</xmp>
</body>
</html>
回答by Jérémy Lal
To use full HTML DOM abilities without triggering requests, without having to deal with incompatibilities:
要在不触发请求的情况下使用完整的 HTML DOM 功能,而不必处理不兼容性:
var doc = document.cloneNode();
if (!doc.documentElement) {
doc.appendChild(doc.createElement('html'));
doc.documentElement.appendChild(doc.createElement('head'));
doc.documentElement.appendChild(doc.createElement('body'));
}
All set ! doc is an html document, but it is not online.
可以了,好了 !doc 是一个 html 文档,但它不在线。