在 Javascript 中解析 HTML 的最佳方法

Question

提问by elshae

I am having a lot of trouble learning RegExp and coming up with a good algorithm to do this. I have this string of HTML that I need to parse. Note that when I am parsing it, it is still a string object and not yet HTML on the browser as I need to parse it before it gets there. The HTML looks like this:

我在学习 RegExp 并想出一个很好的算法来做到这一点时遇到了很多麻烦。我有这个需要解析的 HTML 字符串。请注意，当我解析它时，它仍然是一个字符串对象，而不是浏览器上的 HTML，因为我需要在它到达那里之前对其进行解析。HTML 如下所示：

<html>
  <head>
    <title>Geoserver GetFeatureInfo output</title>
  </head>
  <style type="text/css">
    table.featureInfo, table.featureInfo td, table.featureInfo th {
        border:1px solid #ddd;
        border-collapse:collapse;
        margin:0;
        padding:0;
        font-size: 90%;
        padding:.2em .1em;
    }
    table.featureInfo th {
        padding:.2em .2em;
        font-weight:bold;
        background:#eee;
    }
    table.featureInfo td{
        background:#fff;
    }
    table.featureInfo tr.odd td{
        background:#eee;
    }
    table.featureInfo caption{
        text-align:left;
        font-size:100%;
        font-weight:bold;
        text-transform:uppercase;
        padding:.2em .2em;
    }
  </style>

  <body>
    <table class="featureInfo2">
    <tr>
        <th class="dataLayer" colspan="5">Tibetan Villages</th>
    </tr>
    <!-- EOF Data Layer -->
    <tr class="dataHeaders">
        <th>ID</th>
        <th>Latitude</th>
        <th>Longitude</th>
        <th>Place Name</th>
        <th>English Translation</th>
    </tr>
    <!-- EOF Data Headers -->
    <!-- Data -->
    <tr>
    <!-- Feature Info Data -->
        <td>3394</td>
        <td>29.1</td>
        <td>93.15</td>
        <td>??????????????</td>
        <td>Dam Drongtso </td>
    </tr>
    <!-- EOF Feature Info Data -->
    <!-- End Data -->
    </table>
    <br/>
  </body>
</html>

and I need to get it like this:

我需要像这样得到它：

3394,
29.1,
93.15,
??????????????,
Dam Drongtso

Basically an array...even better if it matches according to its field headers and from which table they are somehow, which look like this:

基本上是一个数组......如果它根据其字段标题以及它们来自哪个表以某种方式进行匹配则更好，看起来像这样：

Tibetan Villages

ID
Latitude
Longitude
Place Name
English Translation

Finding out JavaScript does not support wonderful mapping was a bummer and I have what I want working already. However it is VERY VERY hard coded and I'm thinking I should probably use RegExp to handle this better. Unfortunately I am having a real tough time :(. Here is my function to parse my string (very ugly IMO):

发现 JavaScript 不支持美妙的映射是一件令人沮丧的事情，我已经有了我想要的工作。然而，它是非常非常硬编码的，我想我可能应该使用 RegExp 来更好地处理这个问题。不幸的是，我真的很艰难:(。这是我解析字符串的函数（非常丑陋的 IMO）：

    function parseHTML(html){

    //Getting the layer name
    alert(html);
    //Lousy attempt at RegExp
    var somestring = html.replace('/m//\<html\>+\<body\>//m/',' ');
    alert(somestring);
    var startPos = html.indexOf('<th class="dataLayer" colspan="5">');
    var length = ('<th class="dataLayer" colspan="5">').length;
    var endPos = html.indexOf('</th></tr><!-- EOF Data Layer -->');
    var dataLayer = html.substring(startPos + length, endPos);

    //Getting the data headers
    startPos = html.indexOf('<tr class="dataHeaders">');
    length = ('<tr class="dataHeaders">').length;
    endPos = html.indexOf('</tr><!-- EOF Data Headers -->');
    var newString = html.substring(startPos + length, endPos);
    newString = newString.replace(/<th>/g, '');
    newString = newString.substring(0, newString.lastIndexOf('</th>'));
    var featureInfoHeaders = new Array();
    featureInfoHeaders = newString.split('</th>');

    //Getting the data
    startPos = html.indexOf('<!-- Data -->');
    length = ('<!-- Data -->').length;
    endPos = html.indexOf('<!-- End Data -->');
    newString = html.substring(startPos + length, endPos);
    newString = newString.substring(0, newString.lastIndexOf('</tr><!-- EOF Feature Info Data -->'));
    var featureInfoData = new Array();
    featureInfoData = newString.split('</tr><!-- EOF Feature Info Data -->');

    for(var s = 0; s < featureInfoData.length; s++){
        startPos = featureInfoData[s].indexOf('<!-- Feature Info Data -->');
        length = ('<!-- Feature Info Data -->').length;
        endPos = featureInfoData[s].lastIndexOf('</td>');
        featureInfoData[s] = featureInfoData[s].substring(startPos + length, endPos);
        featureInfoData[s] = featureInfoData[s].replace(/<td>/g, '');
        featureInfoData[s] = featureInfoData[s].split('</td>');
    }//end for

    alert(featureInfoData);

    //Put all the feature info in one array
    var featureInfo = new Array();
    var len = featureInfoData.length;
    for(var j = 0; j < len; j++){
        featureInfo[j] = new Object();
        featureInfo[j].id = featureInfoData[j][0];
        featureInfo[j].latitude = featureInfoData[j][1];
        featureInfo[j].longitude = featureInfoData[j][2];
        featureInfo[j].placeName = featureInfoData[j][3];
        featureInfo[j].translation = featureInfoData[j][4];
        }//end for 

    //This can be ignored for now...
        var string = redesignHTML(featureInfoHeaders, featureInfo);
        return string;

    }//end parseHTML

So as you can see if the content in that string ever changes, my code will be horribly broken. I want to avoid that as much as possible and try to write better code. I appreciate all the help and advice you can give me.

因此，您可以看到该字符串中的内容是否会发生变化，我的代码将被严重破坏。我想尽可能避免这种情况并尝试编写更好的代码。我感谢你能给我的所有帮助和建议。

Answer 1

回答by Ivo Wetzel

Do the following steps:

执行以下步骤：

Create a new documentFragment
Put your HTML string in it
Use selectors to get what you want

创建一个新的 documentFragment
将您的 HTML 字符串放入其中
使用选择器来获得你想要的

Why do all the parsing work - which won't work anyways, since HTML is notparsable via RegExp - when you have the best HTML parser available? (the Browser)

为什么所有的解析工作-这将不反正工作，因为HTML是不是可解析通过正则表达式-当你有最好的HTML解析器可用？（浏览器）

Answer 2

回答by Gabriele Petrioli

You can use jQueryto easily traverse the DOM and create an object with the structure automatically.

您可以使用jQuery轻松遍历 DOM 并自动创建具有该结构的对象。

var $dom = $('<html>').html(the_html_string_variable_goes_here);
var featureInfo = {};

$('table:has(.dataLayer)', $dom).each(function(){
    var $tbl = $(this);
    var section = $tbl.find('.dataLayer').text();
    var obj = [];
    var $structure = $tbl.find('.dataHeaders');
    var structure = $structure.find('th').map(function(){return $(this).text().toLowerCase();});
    var $datarows= $structure.nextAll('tr');
    $datarows.each(function(i){
        obj[i] = {};
        $(this).find('td').each(function(index,element){
            obj[i][structure[index]] = $(element).text();
        });
    });
    featureInfo[section] = obj;
});

Working Demo

工作演示

The code can work with multiple tables with different structures inside.. and also multiple data rows inside each table..

该代码可以处理内部具有不同结构的多个表......以及每个表中的多个数据行......

The featureInfo will hold the final structure and data, and can be accessed like

featureInfo 将保存最终的结构和数据，并且可以像这样访问

alert( featureInfo['Tibetan Villages'][0]['English Translation'] );

or

或者

alert( featureInfo['Tibetan Villages'][0].id );

Answer 3

回答by markasoftware

The "correct" way to do it is with DOMParser. Do it like this:

“正确”的方法是使用DOMParser. 像这样做：

var parsed=new DOMParser.parseFromString(htmlString,'text/html');

Or, if you're worried about browser compatibility, use the polyfill on the MDN documentation:

或者，如果您担心浏览器兼容性，请使用MDN 文档中的polyfill：

/*
 * DOMParser HTML extension
 * 2012-09-04
 * 
 * By Eli Grey, http://eligrey.com
 * Public domain.
 * NO WARRANTY EXPRESSED OR IMPLIED. USE AT YOUR OWN RISK.
 */

/*! @source https://gist.github.com/1129031 */
/*global document, DOMParser*/

(function(DOMParser) {
    "use strict";

    var
      DOMParser_proto = DOMParser.prototype
    , real_parseFromString = DOMParser_proto.parseFromString
    ;

    // Firefox/Opera/IE throw errors on unsupported types
    try {
        // WebKit returns null on unsupported types
        if ((new DOMParser).parseFromString("", "text/html")) {
            // text/html parsing is natively supported
            return;
        }
    } catch (ex) {}

    DOMParser_proto.parseFromString = function(markup, type) {
        if (/^\s*text\/html\s*(?:;|$)/i.test(type)) {
            var
              doc = document.implementation.createHTMLDocument("")
            ;
                if (markup.toLowerCase().indexOf('<!doctype') > -1) {
                    doc.documentElement.innerHTML = markup;
                }
                else {
                    doc.body.innerHTML = markup;
                }
            return doc;
        } else {
            return real_parseFromString.apply(this, arguments);
        }
    };
}(DOMParser));

Answer 4

回答by Robert Koritnik

Change server-side code if you can (add JSON)

如果可以，更改服务器端代码（添加 JSON）

If you're the one that generates the resulting HTML on the server side you could as well generate a JSON there and pass it inside the HTML with the content. You wouldn't have to parse anything on the client side and all data would be immediately available to your client scripts.

如果您是在服务器端生成结果 HTML 的人，您也可以在那里生成一个 JSON 并将其与内容一起传递到 HTML 中。您不必在客户端解析任何内容，所有数据都将立即可供您的客户端脚本使用。

You could easily put JSON in tableelement as a dataattribute value:

您可以轻松地将 JSONtable作为data属性值放入元素中：

<table class="featureInfo2" data-json="{ID:3394, Latitude:29.1, Longitude:93.15, PlaceName:'??????????????', Translation:'Dam Drongtso'}">
    ...
</table>

Oryou could add dataattributes to TDs that contain data and parse only those using jQuery selectors and generating Javascript object out of them. No need for RegExp parsing.

或者，您可以data向包含数据的 TD添加属性，并仅解析那些使用 jQuery 选择器并从中生成 Javascript 对象的属性。不需要 RegExp 解析。

Answer 5

回答by adardesign

Use John Resig's* pure javascript html parser

使用 John Resig 的 *纯 javascript html 解析器

See demo here

在这里查看演示

*John Resigis the creator of jQuery

* John Resig是jQuery的创造者

Answer 6

回答by kelceyp

I had a similar requirement and not being that experienced with JavaScript I let jquery handle it for me with parseHTML and using find. In my case I was looking for divs with a particular class name.

我有一个类似的要求，但对 JavaScript 没有经验，我让 jquery 使用 parseHTML 和使用 find 为我处理它。就我而言，我正在寻找具有特定类名的 div。

function findElementsInHtmlString(document, htmlString, query) {
    var domArray = $.parseHTML(htmlString, document),
        dom = $();

    // create the dom collection from the array
    $.each(domArray, function(i, o) {
        dom = dom.add(o);
    }

    // return a collection of elements that match the query
    return dom.find(query);
}

var elementsWithClassBuild = findElementsInHtmlString(document, htmlString, '.build');

在 Javascript 中解析 HTML 的最佳方法

提问by elshae

回答by Ivo Wetzel

回答by Gabriele Petrioli

回答by markasoftware

回答by Robert Koritnik

Change server-side code if you can (add JSON)

如果可以，更改服务器端代码（添加 JSON）

回答by adardesign

See demo here

在这里查看演示

回答by kelceyp

相关推荐

最近更新

标签

在 Javascript 中解析 HTML 的最佳方法

提问by elshae

回答by Ivo Wetzel

回答by Gabriele Petrioli

回答by markasoftware

回答by Robert Koritnik

Change server-side code if you can (add JSON)

如果可以，更改服务器端代码（添加 JSON）

回答by adardesign

See demo here

在这里查看演示

回答by kelceyp

相关推荐

Javascript：“拼接”的算法性能是什么？

JavaScript 中的整数除法余数？

Javascript 将页面内容加载到变量

jQuery/JavaScript 碰撞检测

相关推荐

最近更新

标签