在 VBA 中的 <tr> 或 <td> 标签内抓取 html 数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27786717/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-12 05:42:35  来源:igfitidea点击:

Scrape html data within a <tr> or <td> tag in VBA

htmlexcel-vbaweb-scrapingvbaexcel

提问by kamelkid2

<tr>
    <td>Tanks:<br /><i>Lost:<br />Destroyed:</i></td>
    <td>750<br /><i>6<br />18</i></td>
</tr>
<tr>
    <td>Tanks:<br /><i>Lost:<br />Destroyed:</i></td>
    <td>750<br /><i>6<br />18</i></td>
</tr>

I am trying to scrape data from a website that has html structured like this within VBA. the value of interest that I want is "750" however it can sometimes be 0, 1,000,000, or any number in between so a set number of characters to extract wont work.

我正在尝试从一个在 VBA 中具有这样结构的 html 的网站抓取数据。我想要的兴趣值是“750”,但它有时可以是 0、1,000,000 或介于两者之间的任何数字,因此要提取的一组字符数将不起作用。

can anyone give some insight on the best way to scrape this? this is my code that will import all of the text as is, but the logic to post process and trim the data of interest is proving very difficult so i am looking for a nice clean way to scrape the 750 slot as is.

任何人都可以提供一些有关刮这个的最佳方法的见解吗?这是我的代码,将按原样导入所有文本,但后处理和修剪感兴趣的数据的逻辑被证明非常困难,因此我正在寻找一种很好的清洁方法来按原样刮掉 750 插槽。

Set elems = IE.document.getElementsByTagName("tr")
    For Each e In elems

        If e.innerText Like "Tanks:*" Then
            msgbox e
        End If

    next e

回答by Matteo NNZ

Within the row (tr), the content you want seems to be always in the second tdand it is the first content before the linebreak <br/>. The stable structure of your HTML seems to be:

在行 ( tr) 中,您想要的内容似乎总是在第二行中td,并且是 linebreak 之前的第一个内容<br/>。您的 HTML 的稳定结构似乎是:

<tr>
    <td>
    </td>

    <td> 'we look for the first stuff inside here, before the </br> comes
    </td>
</tr>

So, starting from your code:

所以,从你的代码开始:

Set elems = IE.document.getElementsByTagName("tr")
For Each e In elems

If e.innerText Like "Tanks:*" Then 'finding the right <tr>

    'get full HTML inside the <tr></tr>
     fullHTML = e.innerHTML

    'first step: parsing until the second <td> comes out...
    lookFor = "<td>"
    startPos = 8 'we can ignore the first 4, we know that <td> is not the one we look for    
    foundThis = Right(Left(fullHTML,startPos),4) 'store current 4 characters    
    Do While foundThis <> lookFor
        startPos = startPos + 1
        foundThis = Right(Left(fullHTML,startPos),4)
    Loop
    'once out, we can take the string starting from your 750 until the end
    remainingHTML = Right(Left(fullHTML,startPos+6),Len(fullHTML)-startPos)     
    'so now we parse until we encounter the "<" of the break row tag    
    myValue = ""
    startPos = 1
    newParse = Right(Left(remainingHTML,startPos),1)
    Do While newParse <> "<"
        myValue = myValue & newParse
        startPos = startPos + 1
        newParse = Right(Left(remainingHTML,startPos),1)
    Loop    

    MsgBox myValue 'here is your 750, 1,000,000 or whatever else

End If

Next e

Please note that the parsing would be much easier if you could reference a JavaScript library in your VBA project. In that case, you could just create a list of children:

请注意,如果您可以在 VBA 项目中引用 JavaScript 库,解析会容易得多。在这种情况下,您可以创建一个孩子列表:

If e.innerText Like "Tanks:*" Then
    puppies = e.children
    'puppies = ["<td></td>", "<td></td>"]
End If

Like this, you could directly parse the second element of the collection. NOTEthe code is not tested and might need to be revised in debug to make it working properly. This is just an idea of how you can structure your parsing.

像这样,您可以直接解析集合的第二个元素。 注意代码未经测试,可能需要在调试中进行修改以使其正常工作。这只是关于如何构建解析的一个想法。