在 VBA 中的 <tr> 或 <td> 标签内抓取 html 数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27786717/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scrape html data within a <tr> or <td> tag in VBA
提问by kamelkid2
<tr>
<td>Tanks:<br /><i>Lost:<br />Destroyed:</i></td>
<td>750<br /><i>6<br />18</i></td>
</tr>
<tr>
<td>Tanks:<br /><i>Lost:<br />Destroyed:</i></td>
<td>750<br /><i>6<br />18</i></td>
</tr>
I am trying to scrape data from a website that has html structured like this within VBA. the value of interest that I want is "750" however it can sometimes be 0, 1,000,000, or any number in between so a set number of characters to extract wont work.
我正在尝试从一个在 VBA 中具有这样结构的 html 的网站抓取数据。我想要的兴趣值是“750”,但它有时可以是 0、1,000,000 或介于两者之间的任何数字,因此要提取的一组字符数将不起作用。
can anyone give some insight on the best way to scrape this? this is my code that will import all of the text as is, but the logic to post process and trim the data of interest is proving very difficult so i am looking for a nice clean way to scrape the 750 slot as is.
任何人都可以提供一些有关刮这个的最佳方法的见解吗?这是我的代码,将按原样导入所有文本,但后处理和修剪感兴趣的数据的逻辑被证明非常困难,因此我正在寻找一种很好的清洁方法来按原样刮掉 750 插槽。
Set elems = IE.document.getElementsByTagName("tr")
For Each e In elems
If e.innerText Like "Tanks:*" Then
msgbox e
End If
next e
回答by Matteo NNZ
Within the row (tr
), the content you want seems to be always in the second td
and it is the first content before the linebreak <br/>
.
The stable structure of your HTML seems to be:
在行 ( tr
) 中,您想要的内容似乎总是在第二行中td
,并且是 linebreak 之前的第一个内容<br/>
。您的 HTML 的稳定结构似乎是:
<tr>
<td>
</td>
<td> 'we look for the first stuff inside here, before the </br> comes
</td>
</tr>
So, starting from your code:
所以,从你的代码开始:
Set elems = IE.document.getElementsByTagName("tr")
For Each e In elems
If e.innerText Like "Tanks:*" Then 'finding the right <tr>
'get full HTML inside the <tr></tr>
fullHTML = e.innerHTML
'first step: parsing until the second <td> comes out...
lookFor = "<td>"
startPos = 8 'we can ignore the first 4, we know that <td> is not the one we look for
foundThis = Right(Left(fullHTML,startPos),4) 'store current 4 characters
Do While foundThis <> lookFor
startPos = startPos + 1
foundThis = Right(Left(fullHTML,startPos),4)
Loop
'once out, we can take the string starting from your 750 until the end
remainingHTML = Right(Left(fullHTML,startPos+6),Len(fullHTML)-startPos)
'so now we parse until we encounter the "<" of the break row tag
myValue = ""
startPos = 1
newParse = Right(Left(remainingHTML,startPos),1)
Do While newParse <> "<"
myValue = myValue & newParse
startPos = startPos + 1
newParse = Right(Left(remainingHTML,startPos),1)
Loop
MsgBox myValue 'here is your 750, 1,000,000 or whatever else
End If
Next e
Please note that the parsing would be much easier if you could reference a JavaScript library in your VBA project. In that case, you could just create a list of children:
请注意,如果您可以在 VBA 项目中引用 JavaScript 库,解析会容易得多。在这种情况下,您可以创建一个孩子列表:
If e.innerText Like "Tanks:*" Then
puppies = e.children
'puppies = ["<td></td>", "<td></td>"]
End If
Like this, you could directly parse the second element of the collection. NOTEthe code is not tested and might need to be revised in debug to make it working properly. This is just an idea of how you can structure your parsing.
像这样,您可以直接解析集合的第二个元素。 注意代码未经测试,可能需要在调试中进行修改以使其正常工作。这只是关于如何构建解析的一个想法。