在 Javascript 中解析 HTML 的最佳方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4247838/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best way to parse HTML in Javascript
提问by elshae
I am having a lot of trouble learning RegExp and coming up with a good algorithm to do this. I have this string of HTML that I need to parse. Note that when I am parsing it, it is still a string object and not yet HTML on the browser as I need to parse it before it gets there. The HTML looks like this:
我在学习 RegExp 并想出一个很好的算法来做到这一点时遇到了很多麻烦。我有这个需要解析的 HTML 字符串。请注意,当我解析它时,它仍然是一个字符串对象,而不是浏览器上的 HTML,因为我需要在它到达那里之前对其进行解析。HTML 如下所示:
<html>
<head>
<title>Geoserver GetFeatureInfo output</title>
</head>
<style type="text/css">
table.featureInfo, table.featureInfo td, table.featureInfo th {
border:1px solid #ddd;
border-collapse:collapse;
margin:0;
padding:0;
font-size: 90%;
padding:.2em .1em;
}
table.featureInfo th {
padding:.2em .2em;
font-weight:bold;
background:#eee;
}
table.featureInfo td{
background:#fff;
}
table.featureInfo tr.odd td{
background:#eee;
}
table.featureInfo caption{
text-align:left;
font-size:100%;
font-weight:bold;
text-transform:uppercase;
padding:.2em .2em;
}
</style>
<body>
<table class="featureInfo2">
<tr>
<th class="dataLayer" colspan="5">Tibetan Villages</th>
</tr>
<!-- EOF Data Layer -->
<tr class="dataHeaders">
<th>ID</th>
<th>Latitude</th>
<th>Longitude</th>
<th>Place Name</th>
<th>English Translation</th>
</tr>
<!-- EOF Data Headers -->
<!-- Data -->
<tr>
<!-- Feature Info Data -->
<td>3394</td>
<td>29.1</td>
<td>93.15</td>
<td>??????????????</td>
<td>Dam Drongtso </td>
</tr>
<!-- EOF Feature Info Data -->
<!-- End Data -->
</table>
<br/>
</body>
</html>
and I need to get it like this:
我需要像这样得到它:
3394,
29.1,
93.15,
??????????????,
Dam Drongtso
Basically an array...even better if it matches according to its field headers and from which table they are somehow, which look like this:
基本上是一个数组......如果它根据其字段标题以及它们来自哪个表以某种方式进行匹配则更好,看起来像这样:
Tibetan Villages
ID
Latitude
Longitude
Place Name
English Translation
Finding out JavaScript does not support wonderful mapping was a bummer and I have what I want working already. However it is VERY VERY hard coded and I'm thinking I should probably use RegExp to handle this better. Unfortunately I am having a real tough time :(. Here is my function to parse my string (very ugly IMO):
发现 JavaScript 不支持美妙的映射是一件令人沮丧的事情,我已经有了我想要的工作。然而,它是非常非常硬编码的,我想我可能应该使用 RegExp 来更好地处理这个问题。不幸的是,我真的很艰难:(。这是我解析字符串的函数(非常丑陋的 IMO):
function parseHTML(html){
//Getting the layer name
alert(html);
//Lousy attempt at RegExp
var somestring = html.replace('/m//\<html\>+\<body\>//m/',' ');
alert(somestring);
var startPos = html.indexOf('<th class="dataLayer" colspan="5">');
var length = ('<th class="dataLayer" colspan="5">').length;
var endPos = html.indexOf('</th></tr><!-- EOF Data Layer -->');
var dataLayer = html.substring(startPos + length, endPos);
//Getting the data headers
startPos = html.indexOf('<tr class="dataHeaders">');
length = ('<tr class="dataHeaders">').length;
endPos = html.indexOf('</tr><!-- EOF Data Headers -->');
var newString = html.substring(startPos + length, endPos);
newString = newString.replace(/<th>/g, '');
newString = newString.substring(0, newString.lastIndexOf('</th>'));
var featureInfoHeaders = new Array();
featureInfoHeaders = newString.split('</th>');
//Getting the data
startPos = html.indexOf('<!-- Data -->');
length = ('<!-- Data -->').length;
endPos = html.indexOf('<!-- End Data -->');
newString = html.substring(startPos + length, endPos);
newString = newString.substring(0, newString.lastIndexOf('</tr><!-- EOF Feature Info Data -->'));
var featureInfoData = new Array();
featureInfoData = newString.split('</tr><!-- EOF Feature Info Data -->');
for(var s = 0; s < featureInfoData.length; s++){
startPos = featureInfoData[s].indexOf('<!-- Feature Info Data -->');
length = ('<!-- Feature Info Data -->').length;
endPos = featureInfoData[s].lastIndexOf('</td>');
featureInfoData[s] = featureInfoData[s].substring(startPos + length, endPos);
featureInfoData[s] = featureInfoData[s].replace(/<td>/g, '');
featureInfoData[s] = featureInfoData[s].split('</td>');
}//end for
alert(featureInfoData);
//Put all the feature info in one array
var featureInfo = new Array();
var len = featureInfoData.length;
for(var j = 0; j < len; j++){
featureInfo[j] = new Object();
featureInfo[j].id = featureInfoData[j][0];
featureInfo[j].latitude = featureInfoData[j][1];
featureInfo[j].longitude = featureInfoData[j][2];
featureInfo[j].placeName = featureInfoData[j][3];
featureInfo[j].translation = featureInfoData[j][4];
}//end for
//This can be ignored for now...
var string = redesignHTML(featureInfoHeaders, featureInfo);
return string;
}//end parseHTML
So as you can see if the content in that string ever changes, my code will be horribly broken. I want to avoid that as much as possible and try to write better code. I appreciate all the help and advice you can give me.
因此,您可以看到该字符串中的内容是否会发生变化,我的代码将被严重破坏。我想尽可能避免这种情况并尝试编写更好的代码。我感谢你能给我的所有帮助和建议。
回答by Ivo Wetzel
Do the following steps:
执行以下步骤:
- Create a new
documentFragment
- Put your HTML string in it
- Use selectors to get what you want
- 创建一个新的
documentFragment
- 将您的 HTML 字符串放入其中
- 使用选择器来获得你想要的
Why do all the parsing work - which won't work anyways, since HTML is notparsable via RegExp - when you have the best HTML parser available? (the Browser)
为什么所有的解析工作-这将不反正工作,因为HTML是不是可解析通过正则表达式-当你有最好的HTML解析器可用?(浏览器)
回答by Gabriele Petrioli
You can use jQueryto easily traverse the DOM and create an object with the structure automatically.
您可以使用jQuery轻松遍历 DOM 并自动创建具有该结构的对象。
var $dom = $('<html>').html(the_html_string_variable_goes_here);
var featureInfo = {};
$('table:has(.dataLayer)', $dom).each(function(){
var $tbl = $(this);
var section = $tbl.find('.dataLayer').text();
var obj = [];
var $structure = $tbl.find('.dataHeaders');
var structure = $structure.find('th').map(function(){return $(this).text().toLowerCase();});
var $datarows= $structure.nextAll('tr');
$datarows.each(function(i){
obj[i] = {};
$(this).find('td').each(function(index,element){
obj[i][structure[index]] = $(element).text();
});
});
featureInfo[section] = obj;
});
The code can work with multiple tables with different structures inside.. and also multiple data rows inside each table..
该代码可以处理内部具有不同结构的多个表......以及每个表中的多个数据行......
The featureInfo will hold the final structure and data, and can be accessed like
featureInfo 将保存最终的结构和数据,并且可以像这样访问
alert( featureInfo['Tibetan Villages'][0]['English Translation'] );
or
或者
alert( featureInfo['Tibetan Villages'][0].id );
回答by markasoftware
The "correct" way to do it is with DOMParser
. Do it like this:
“正确”的方法是使用DOMParser
. 像这样做:
var parsed=new DOMParser.parseFromString(htmlString,'text/html');
Or, if you're worried about browser compatibility, use the polyfill on the MDN documentation:
或者,如果您担心浏览器兼容性,请使用MDN 文档中的polyfill:
/*
* DOMParser HTML extension
* 2012-09-04
*
* By Eli Grey, http://eligrey.com
* Public domain.
* NO WARRANTY EXPRESSED OR IMPLIED. USE AT YOUR OWN RISK.
*/
/*! @source https://gist.github.com/1129031 */
/*global document, DOMParser*/
(function(DOMParser) {
"use strict";
var
DOMParser_proto = DOMParser.prototype
, real_parseFromString = DOMParser_proto.parseFromString
;
// Firefox/Opera/IE throw errors on unsupported types
try {
// WebKit returns null on unsupported types
if ((new DOMParser).parseFromString("", "text/html")) {
// text/html parsing is natively supported
return;
}
} catch (ex) {}
DOMParser_proto.parseFromString = function(markup, type) {
if (/^\s*text\/html\s*(?:;|$)/i.test(type)) {
var
doc = document.implementation.createHTMLDocument("")
;
if (markup.toLowerCase().indexOf('<!doctype') > -1) {
doc.documentElement.innerHTML = markup;
}
else {
doc.body.innerHTML = markup;
}
return doc;
} else {
return real_parseFromString.apply(this, arguments);
}
};
}(DOMParser));
回答by Robert Koritnik
Change server-side code if you can (add JSON)
如果可以,更改服务器端代码(添加 JSON)
If you're the one that generates the resulting HTML on the server side you could as well generate a JSON there and pass it inside the HTML with the content. You wouldn't have to parse anything on the client side and all data would be immediately available to your client scripts.
如果您是在服务器端生成结果 HTML 的人,您也可以在那里生成一个 JSON 并将其与内容一起传递到 HTML 中。您不必在客户端解析任何内容,所有数据都将立即可供您的客户端脚本使用。
You could easily put JSON in table
element as a data
attribute value:
您可以轻松地将 JSONtable
作为data
属性值放入元素中:
<table class="featureInfo2" data-json="{ID:3394, Latitude:29.1, Longitude:93.15, PlaceName:'??????????????', Translation:'Dam Drongtso'}">
...
</table>
Oryou could add data
attributes to TDs that contain data and parse only those using jQuery selectors and generating Javascript object out of them. No need for RegExp parsing.
或者,您可以data
向包含数据的 TD添加属性,并仅解析那些使用 jQuery 选择器并从中生成 Javascript 对象的属性。不需要 RegExp 解析。
回答by adardesign
Use John Resig's* pure javascript html parser
使用 John Resig 的 *纯 javascript html 解析器
See demo here
在这里查看演示
*John Resigis the creator of jQuery
* John Resig是jQuery的创造者
回答by kelceyp
I had a similar requirement and not being that experienced with JavaScript I let jquery handle it for me with parseHTML and using find. In my case I was looking for divs with a particular class name.
我有一个类似的要求,但对 JavaScript 没有经验,我让 jquery 使用 parseHTML 和使用 find 为我处理它。就我而言,我正在寻找具有特定类名的 div。
function findElementsInHtmlString(document, htmlString, query) {
var domArray = $.parseHTML(htmlString, document),
dom = $();
// create the dom collection from the array
$.each(domArray, function(i, o) {
dom = dom.add(o);
}
// return a collection of elements that match the query
return dom.find(query);
}
var elementsWithClassBuild = findElementsInHtmlString(document, htmlString, '.build');