在 VBA 中解析 HTML 内容
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/25488687/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse HTML content in VBA
提问by Tdev
I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th column and one row corresponds to one product (item).
我有一个关于 HTML 解析的问题。我有一个包含一些产品的网站,我想将页面中的文本捕获到我当前的电子表格中。该电子表格很大,但在第 3 列中包含 ItemNbr,我希望第 14 列中的文本和一行对应于一个产品(项目)。
My idea is to fetch the 'Material' on the webpage which is inside the Innertext after tag. The id number changes from one page to page (sometimes ).
我的想法是在标签后的 Innertext 内获取网页上的“材料”。id 号从一页到另一页变化(有时 )。
Here is the structure of the website:
这是网站的结构:
<div style="position:relative;">
    <div></div>
    <table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;">
        <tbody>
            <tr class="jqgfirstrow" role="row" style="height:auto">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td role="gridcell" style="padding-left:10px" title="Material" aria-describedby="list-table_">Material</td>
                <td role="gridcell" style="" title="600D polyester." aria-describedby="list-table_">600D polyester.</td>
            </tr>           
            <tr ...>
            </tr>
        </tbody>
    </table> </div>
I would like to get "600D Polyester" as a result.
结果我想得到“600D聚酯”。
My (not working) code snippet is as is:
我的(不工作)代码片段是这样的:
Sub ParseMaterial()
    Dim Cell As Integer
    Dim ItemNbr As String
    Dim AElement As Object
    Dim AElements As IHTMLElementCollection
Dim IE As MSXML2.XMLHTTP60
Set IE = New MSXML2.XMLHTTP60
Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLBody As MSHTML.HTMLBody
Set HTMLDoc = New MSHTML.HTMLDocument
Set HTMLBody = HTMLDoc.body
For Cell = 1 To 5                            'I iterate through the file row by row
    ItemNbr = Cells(Cell, 3).Value           'ItemNbr isin the 3rd Column of my spreadsheet
    IE.Open "GET", "http://www.example.com/?item=" & ItemNbr, False
    IE.send
    While IE.ReadyState <> 4
        DoEvents
    Wend
    HTMLBody.innerHTML = IE.responseText
    Set AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr")
    For Each AElement In AElements
        If AElement.Title = "Material" Then
            Cells(Cell, 14) = AElement.nextNode.value     'I write the material in the 14th column
        End If
    Next AElement
        Application.Wait (Now + TimeValue("0:00:2"))
Next Cell
Thanks for your help !
谢谢你的帮助 !
采纳答案by IAmDranged
Just a couple things that hopefully will get you in the right direction:
有几件事希望能让你朝着正确的方向前进:
- clean up a bit: remove the readystate property testing loop. The value returned by the readystate property will never change in this context - code will pause after the send instruction, to resume only once the server response is received, or has failed to do so. The readystate property will be set accordingly, and the code will resume execution. You should still test for the ready state, but the loop is just unnecessary 
- target the right HTML elements: you are searching through the tr elements - while the logic of how you use these elements in your code actually looks to point to td elements 
- make sure the properties are actually available for the objects you are using them on: to help you with this, try and declare all your variable as specific objects instead of the generic Object. This will activate intellisense. If you have a difficult time finding the actual name of your object as defined in the relevant library in a first place, declare it as the generic Object, run your code, and then inspect the type of the object - by printing typename(your_object) to the debug window for instance. This should put you on your way 
- 清理一下:删除 readystate 属性测试循环。readystate 属性返回的值在此上下文中永远不会更改 - 代码将在发送指令后暂停,仅在收到服务器响应或未能这样做时恢复。将相应地设置 readystate 属性,并且代码将继续执行。您仍然应该测试就绪状态,但循环是不必要的 
- 定位正确的 HTML 元素:您正在搜索 tr 元素 - 而您如何在代码中使用这些元素的逻辑实际上看起来指向 td 元素 
- 确保这些属性实际上可用于您正在使用它们的对象:为了帮助您解决这个问题,请尝试将您的所有变量声明为特定对象而不是通用对象。这将激活智能感知。如果您首先很难找到相关库中定义的对象的实际名称,请将其声明为通用对象,运行您的代码,然后检查对象的类型 - 通过打印 typename(your_object)例如到调试窗口。这应该让你上路 
I have also included some code below that may help. If you still can't get this to work and you can share your urls - plz do that.
我还在下面包含了一些可能会有所帮助的代码。如果您仍然无法使其正常工作并且您可以共享您的网址 - 请这样做。
Sub getInfoWeb()
    Dim cell As Integer
    Dim xhr As MSXML2.XMLHTTP60
    Dim doc As MSHTML.HTMLDocument
    Dim table As MSHTML.HTMLTable
    Dim tableCells As MSHTML.IHTMLElementCollection
    Set xhr = New MSXML2.XMLHTTP60
    For cell = 1 To 5
        ItemNbr = Cells(cell, 3).Value
        With xhr
            .Open "GET", "http://www.example.com/?item=" & ItemNbr, False
            .send
            If .readyState = 4 And .Status = 200 Then
                Set doc = New MSHTML.HTMLDocument
                doc.body.innerHTML = .responseText
            Else
                MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
                vbNewLine & "HTTP request status: " & .Status
            End If
        End With
        Set table = doc.getElementById("list-table")
        Set tableCells = table.getElementsByTagName("td")
        For Each tableCell In tableCells
            If tableCell.getAttribute("title") = "Material" Then
                Cells(cell, 14).Value = tableCell.NextSibling.innerHTML
            End If
        Next tableCell
    Next cell
End Sub
EDIT: as a follow-up to the further information you provided in the comment below - and the additionnal comments I have added
编辑:作为您在下面评论中提供的更多信息的后续行动 - 以及我添加的其他评论
'Determine your product number
    'Open an xhr for your source url, and retrieve the product number from there - search for the tag which
    'text include the "productnummer:" substring, and extract the product number from the outerstring
    'OR
    'if the product number consistently consists of the fctkeywords you are entering in your source url
    'with two "0" appended - just build the product number like that
'Open an new xhr for this url "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=" & product_number & "&_search=false&rows=-1&page=1&sidx=&sord=asc"
'Load the response in an XML document, and retrieve the material information
Sub getInfoWeb()
    Dim xhr As MSXML2.XMLHTTP60
    Dim doc As MSXML2.DOMDocument60
    Dim xmlCell As MSXML2.IXMLDOMElement
    Dim xmlCells As MSXML2.IXMLDOMNodeList
    Dim materialValueElement As MSXML2.IXMLDOMElement
    Set xhr = New MSXML2.XMLHTTP60
        With xhr
            .Open "GET", "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", False
            .send
            If .readyState = 4 And .Status = 200 Then
                Set doc = New MSXML2.DOMDocument60
                doc.LoadXML .responseText
            Else
                MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
                vbNewLine & "HTTP request status: " & .Status
            End If
        End With
        Set xmlCells = doc.getElementsByTagName("cell")
        For Each xmlCell In xmlCells
            If xmlCell.Text = "Materiaal" Then
                Set materialValueElement = xmlCell.NextSibling
            End If
        Next
        MsgBox materialValueElement.Text
End Sub
EDIT2: an alternative automating IE
EDIT2:另一种自动化 IE
Sub searchWebViaIE()
    Dim ie As SHDocVw.InternetExplorer
    Dim doc As MSHTML.HTMLDocument
    Dim anchors As MSHTML.IHTMLElementCollection
    Dim anchor As MSHTML.HTMLAnchorElement
    Dim prodSpec As MSHTML.HTMLAnchorElement
    Dim tableCells As MSHTML.IHTMLElementCollection
    Dim materialValueElement As MSHTML.HTMLTableCell
    Dim tableCell As MSHTML.HTMLTableCell
    Set ie = New SHDocVw.InternetExplorer
    With ie
        .navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4"
        .Visible = True
        Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
            DoEvents
        Loop
        Set doc = .document
        Set anchors = doc.getElementsByTagName("a")
        For Each anchor In anchors
            If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
                anchor.Click
                Exit For
            End If
        Next anchor
        Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
            DoEvents
        Loop
    End With
    For Each anchor In anchors
        If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
            Set prodSpec = anchor
        End If
    Next anchor
    Set tableCells = doc.getElementById("list-table").getElementsByTagName("td")
    If Not tableCells Is Nothing Then
        For Each tableCell In tableCells
            If tableCell.innerHTML = "Materiaal" Then
                Set materialValueElement = tableCell.NextSibling
            End If
        Next tableCell
    End If
    MsgBox materialValueElement.innerHTML
End Sub
回答by Jean-Marc
Not related to tables or Excel ( I use MS-Access 2013) but directly related to the topic title. My solution is
与表格或 Excel 无关(我使用 MS-Access 2013),但与主题标题直接相关。我的解决方案是
Private Sub Sample(urlSource)
Dim httpRequest As New WinHttpRequest
Dim doc As MSHTML.HTMLDocument
Dim tags As MSHTML.IHTMLElementCollection
Dim tag As MSHTML.HTMLHtmlElement
httpRequest.Option(WinHttpRequestOption_UserAgentString) = "Mozilla/4.0 (compatible;MSIE 7.0; Windows NT 6.0)"
httpRequest.Open "GET", urlSource
httpRequest.send ' fetching webpage
Set doc = New MSHTML.HTMLDocument
doc.body.innerHTML = httpRequest.responseText
Set tags = doc.getElementsByTagName("a")
i = 1
For Each tag In tags
  Debug.Print i
  Debug.Print tag.href
  Debug.Print tag.innerText
  'Debug.Print tag.Attributes("any other attributes you need")() ' may return an object
  i = i + 1
  If i Mod 50 = 0 Then Stop
  ' or code to store results in a table
Next
End Sub

