使用 VBA excel 多个类名从网站中提取数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18709911/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pull data from website using VBA excel multiple classname
提问by gepex
I know this has been asked many times, but haven't seen a clear answer for looping thru a div and findind tags with the same classname.
我知道这已经被问过很多次了,但是还没有看到通过具有相同类名的 div 和 findind 标签循环的明确答案。
My first question:
我的第一个问题:
If I have something like this:
如果我有这样的事情:
<div id="carousel">
<div id="images">
<div class="imageElement">
<img src="img/image1.jpg">
</div>
<div class="imageElement">
<img src="img/image2.jpg">
</div>
<div class="imageElement">
<img src="img/image3.jpg">
</div>
</div>
</div>
So I want to get all the img Src in the div "images" along with other stuff in the imageElement classnames and copy them to some cells in excel.
因此,我想获取 div“图像”中的所有 img Src 以及 imageElement 类名中的其他内容,并将它们复制到 excel 中的某些单元格中。
Second question:I've seen two ways in pulling web content with VBA, one using IE and another code using something but a browser.
第二个问题:我见过两种使用 VBA 提取网页内容的方法,一种使用 IE,另一种使用浏览器以外的代码。
Private Sub pullData_Click()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "http://website.html", False
.send
htm.body.innerHTML = .responsetext
End With
End Sub
And second way:
第二种方式:
Set ie = New InternetExplorer
With ie
.navigate "http://eoddata.com/stockquote/NASDAQ/AAPL.htm"
.Visible = False
While .Busy Or .readyState <> READYSTATE_COMPLETE
DoEvents
Wend
Set objHTML = .document
DoEvents
End With
Set elementONE = objHTML.getElementsByTagName("TD")
For i = 1 To elementONE.Length
elementTWO = elementONE.Item(i).innerText
If elementTWO = "08/10/12" Then
MsgBox (elementONE.Item(i + 1).innerText)
Exit For
End If
Next i
DoEvents
ie.Quit
DoEvents
Set ie = Nothing
Which one is better and why?
哪个更好,为什么?
So if you can help me I'd appreciate.
所以如果你能帮助我,我会很感激。
Thank you in advance.
先感谢您。
回答by SWa
Your first option is usually preferable since it is much faster than the second method, it sends a request directly to the web server and returns the response. This is much more efficient than automating Internet Explorer (the second option); automating IE is very slow, since you are effectively just browsing the site - it will inevitably result in more downloads as it must load all the resources in the page - images, scripts, css files etc. It will also run any Javascript on the page - all of this is usually not useful and you have to wait for it to finish before parsing the page.
您的第一个选项通常更可取,因为它比第二种方法快得多,它将请求直接发送到 Web 服务器并返回响应。这比自动化 Internet Explorer(第二个选项)要高效得多;自动化 IE 非常慢,因为您实际上只是在浏览站点 - 这将不可避免地导致更多下载,因为它必须加载页面中的所有资源 - 图像、脚本、css 文件等。它还将在页面上运行任何 Javascript - 所有这些通常都没有用,您必须等待它完成才能解析页面。
This however is a bit of a double edged sword - whilst much slower, if you are not familiar with html requests, automating Internet Explorer is substantially easier than the first method, especially when elements are generated dynamically or the page has a reliance on AJAX. It is also easier to automate IE when you need to access data in a site that requires you to log in since it will handle the relevant cookies for you. This is not to say that web scraping cannot be done with the first method, rather than it requires a deeper understanding of web technologies and the architecture of the site.
然而,这有点像一把双刃剑——虽然慢得多,但如果您不熟悉 html 请求,那么自动化 Internet Explorer 比第一种方法要容易得多,尤其是当元素是动态生成的或者页面依赖于 AJAX 时。当您需要访问需要登录的站点中的数据时,自动化 IE 也更容易,因为它会为您处理相关的 cookie。这并不是说第一种方法不能完成网络抓取,而是需要对网络技术和网站架构有更深入的了解。
A better option to the first method would be to use a different object to handle the request and response, using the WinHTTP library offers more resilience than the MSXML library and will generally handle any cookies automatically as well.
第一种方法的更好选择是使用不同的对象来处理请求和响应,使用 WinHTTP 库比 MSXML 库提供更大的弹性,并且通常也会自动处理任何 cookie。
As for parsing the data, in your first approach you have used late binding to create the HTML Object (htmlfile), whilst this reduces the need for a reference, it also reduces functionality. For example, when using late binding, you are missing out on the features added if the user has IE9 installed, specifically in this case the getElementsByClass name function.
至于解析数据,在您的第一种方法中,您使用后期绑定来创建 HTML 对象 (htmlfile),虽然这减少了对引用的需求,但也减少了功能。例如,当使用后期绑定时,如果用户安装了 IE9,您就会错过添加的功能,特别是在这种情况下 getElementsByClass 名称函数。
As such a third option (and my preferred method):
作为第三种选择(也是我的首选方法):
Dim oHtml As HTMLDocument
Dim oElement As Object
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "http://www.someurl.com", False
.send
oHtml.body.innerHTML = .responseText
End With
For Each oElement In oHtml.getElementsByClassName("imageElement")
Debug.Print oElement.Children(0).src
Next oElement
'IE 8 alternative
'For Each oElement In oHtml.getElementsByTagName("div")
' If oElement.className = "imageElement" Then
' Debug.Print oElement.Children(0).src
' End If
'Next oElement
This will require a reference setting to the Microsoft HTML Object Library
- it will fail if the user does not have IE9 installed, but this can be handled and is becoming increasingly less relevant
这将需要一个参考设置Microsoft HTML Object Library
- 如果用户没有安装 IE9,它将失败,但这可以处理并且变得越来越不相关
回答by JS20'07'11
To print elements to cells replace:
要将元素打印到单元格替换:
For Each oElement In oHtml.getElementsByClassName("imageElement")
Debug.Print oElement.Children(0).src
Next oElement
With:
和:
Dim wsTarget as Worksheet
dim i as Integer
i=1
set wsTarget=activeworkbook.worksheets("SomeSheet")
For Each oElement In oHtml.getElementsByClassName("imageElement")
wstarget.range("A" & i)=oElement.Children(0).src
i=i+1
Next
'Corrected the syntax error on For
'更正了 For 的语法错误
回答by QHarr
CSS selector:
CSS 选择器:
You could also use a CSS selectorof #images img[src^='img/']
.
你也可以使用一个CSS选择器的#images img[src^='img/']
。
This says elements with id of images
that contain tagname img
with attribute src
having value starting with 'img/'
.
这表示具有 id 的元素images
包含标记名img
,其属性src
值以'img/'
.
The #
is for id; []
for attribute; ^
for starts with; #images img
, img
within images
.
的#
是ID; []
对于属性;^
开始于; #images img
,img
内images
。
CSS query:
CSS查询:
As more than one element will be matched you would use the .querySelectorAll
method of document
and then loop the length of the returned nodeList
.
由于将匹配多个元素,您将使用 的.querySelectorAll
方法,document
然后循环返回的nodeList
.
VBA Code:
VBA 代码:
Option Explicit
Public Sub test()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "http://www.someurl.com", False
.send
html.body.innerHTML = .responseText
End With
Dim aNodeList As Object, iItem As Long
Set aNodeList = html.querySelectorAll("#images img[src^='img/']")
With ActiveSheet
For iItem = 0 To aNodeList.Length - 1
.Cells(iItem + 1, 1) = aNodeList.item(iItem).innerText
'.Cells(iItem + 1, 1) = aNodeList(iItem).innerText '<== or potentially this syntax
Next iItem
End With
End Sub