使用 VBA excel 多个类名从网站中提取数据

Question

提问by gepex

I know this has been asked many times, but haven't seen a clear answer for looping thru a div and findind tags with the same classname.

我知道这已经被问过很多次了，但是还没有看到通过具有相同类名的 div 和 findind 标签循环的明确答案。

My first question:

我的第一个问题：

If I have something like this:

如果我有这样的事情：

<div id="carousel">
   <div id="images">

       <div class="imageElement">
          <img src="img/image1.jpg">
       </div>

       <div class="imageElement">
          <img src="img/image2.jpg">
       </div>

       <div class="imageElement">
           <img src="img/image3.jpg">
       </div>

   </div>

</div>

So I want to get all the img Src in the div "images" along with other stuff in the imageElement classnames and copy them to some cells in excel.

因此，我想获取 div“图像”中的所有 img Src 以及 imageElement 类名中的其他内容，并将它们复制到 excel 中的某些单元格中。

Second question:I've seen two ways in pulling web content with VBA, one using IE and another code using something but a browser.

第二个问题：我见过两种使用 VBA 提取网页内容的方法，一种使用 IE，另一种使用浏览器以外的代码。

Private Sub pullData_Click()

    Dim x As Long, y As Long
    Dim htm As Object

    Set htm = CreateObject("htmlFile")

    With CreateObject("msxml2.xmlhttp")
        .Open "GET", "http://website.html", False
        .send
        htm.body.innerHTML = .responsetext
    End With

End Sub

And second way:

第二种方式：

Set ie = New InternetExplorer
    With ie
        .navigate "http://eoddata.com/stockquote/NASDAQ/AAPL.htm"
        .Visible = False
        While .Busy Or .readyState <> READYSTATE_COMPLETE
           DoEvents
        Wend
        Set objHTML = .document
        DoEvents
    End With
    Set elementONE = objHTML.getElementsByTagName("TD")
    For i = 1 To elementONE.Length
        elementTWO = elementONE.Item(i).innerText           
        If elementTWO = "08/10/12" Then
            MsgBox (elementONE.Item(i + 1).innerText)
            Exit For
        End If
    Next i
    DoEvents
    ie.Quit
    DoEvents
    Set ie = Nothing

Which one is better and why?

哪个更好，为什么？

So if you can help me I'd appreciate.

所以如果你能帮助我，我会很感激。

Thank you in advance.

先感谢您。

Answer 1

回答by SWa

Your first option is usually preferable since it is much faster than the second method, it sends a request directly to the web server and returns the response. This is much more efficient than automating Internet Explorer (the second option); automating IE is very slow, since you are effectively just browsing the site - it will inevitably result in more downloads as it must load all the resources in the page - images, scripts, css files etc. It will also run any Javascript on the page - all of this is usually not useful and you have to wait for it to finish before parsing the page.

您的第一个选项通常更可取，因为它比第二种方法快得多，它将请求直接发送到 Web 服务器并返回响应。这比自动化 Internet Explorer（第二个选项）要高效得多；自动化 IE 非常慢，因为您实际上只是在浏览站点 - 这将不可避免地导致更多下载，因为它必须加载页面中的所有资源 - 图像、脚本、css 文件等。它还将在页面上运行任何 Javascript - 所有这些通常都没有用，您必须等待它完成才能解析页面。

This however is a bit of a double edged sword - whilst much slower, if you are not familiar with html requests, automating Internet Explorer is substantially easier than the first method, especially when elements are generated dynamically or the page has a reliance on AJAX. It is also easier to automate IE when you need to access data in a site that requires you to log in since it will handle the relevant cookies for you. This is not to say that web scraping cannot be done with the first method, rather than it requires a deeper understanding of web technologies and the architecture of the site.

然而，这有点像一把双刃剑——虽然慢得多，但如果您不熟悉 html 请求，那么自动化 Internet Explorer 比第一种方法要容易得多，尤其是当元素是动态生成的或者页面依赖于 AJAX 时。当您需要访问需要登录的站点中的数据时，自动化 IE 也更容易，因为它会为您处理相关的 cookie。这并不是说第一种方法不能完成网络抓取，而是需要对网络技术和网站架构有更深入的了解。

A better option to the first method would be to use a different object to handle the request and response, using the WinHTTP library offers more resilience than the MSXML library and will generally handle any cookies automatically as well.

第一种方法的更好选择是使用不同的对象来处理请求和响应，使用 WinHTTP 库比 MSXML 库提供更大的弹性，并且通常也会自动处理任何 cookie。

As for parsing the data, in your first approach you have used late binding to create the HTML Object (htmlfile), whilst this reduces the need for a reference, it also reduces functionality. For example, when using late binding, you are missing out on the features added if the user has IE9 installed, specifically in this case the getElementsByClass name function.

至于解析数据，在您的第一种方法中，您使用后期绑定来创建 HTML 对象 (htmlfile)，虽然这减少了对引用的需求，但也减少了功能。例如，当使用后期绑定时，如果用户安装了 IE9，您就会错过添加的功能，特别是在这种情况下 getElementsByClass 名称函数。

As such a third option (and my preferred method):

作为第三种选择（也是我的首选方法）：

Dim oHtml       As HTMLDocument
Dim oElement    As Object

Set oHtml = New HTMLDocument


With CreateObject("WINHTTP.WinHTTPRequest.5.1")
    .Open "GET", "http://www.someurl.com", False
    .send
    oHtml.body.innerHTML = .responseText
End With

For Each oElement In oHtml.getElementsByClassName("imageElement")
    Debug.Print oElement.Children(0).src
Next oElement

'IE 8 alternative
'For Each oElement In oHtml.getElementsByTagName("div")
'    If oElement.className = "imageElement" Then
'        Debug.Print oElement.Children(0).src
'    End If
'Next oElement

This will require a reference setting to the Microsoft HTML Object Library- it will fail if the user does not have IE9 installed, but this can be handled and is becoming increasingly less relevant

这将需要一个参考设置Microsoft HTML Object Library- 如果用户没有安装 IE9，它将失败，但这可以处理并且变得越来越不相关

Answer 2

回答by JS20'07'11

To print elements to cells replace:

要将元素打印到单元格替换：

For Each oElement In oHtml.getElementsByClassName("imageElement")
    Debug.Print oElement.Children(0).src
Next oElement

With:

和：

Dim wsTarget as Worksheet
dim i as Integer
i=1
set wsTarget=activeworkbook.worksheets("SomeSheet")

For Each oElement In oHtml.getElementsByClassName("imageElement")
    wstarget.range("A" & i)=oElement.Children(0).src
    i=i+1
Next

'Corrected the syntax error on For

'更正了 For 的语法错误

Answer 3

回答by QHarr

CSS selector:

CSS 选择器：

You could also use a CSS selectorof #images img[src^='img/'].

你也可以使用一个CSS选择器的#images img[src^='img/']。

This says elements with id of imagesthat contain tagname imgwith attribute srchaving value starting with 'img/'.

这表示具有 id 的元素images包含标记名img，其属性src值以'img/'.

The #is for id; []for attribute; ^for starts with; #images img, imgwithin images.

的#是ID; []对于属性；^开始于; #images img，img内images。

CSS query:

CSS查询：

As more than one element will be matched you would use the .querySelectorAllmethod of documentand then loop the length of the returned nodeList.

由于将匹配多个元素，您将使用的.querySelectorAll方法，document然后循环返回的nodeList.

VBA Code:

VBA 代码：

Option Explicit
Public Sub test()
    Dim html As HTMLDocument
    Set html = New HTMLDocument

    With CreateObject("WINHTTP.WinHTTPRequest.5.1")
        .Open "GET", "http://www.someurl.com", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim aNodeList As Object, iItem As Long
    Set aNodeList = html.querySelectorAll("#images img[src^='img/']")
    With ActiveSheet
        For iItem = 0 To aNodeList.Length - 1
            .Cells(iItem + 1, 1) = aNodeList.item(iItem).innerText
            '.Cells(iItem + 1, 1) = aNodeList(iItem).innerText '<== or potentially this syntax
        Next iItem
    End With
End Sub

使用 VBA excel 多个类名从网站中提取数据

提问by gepex

回答by SWa

回答by JS20'07'11

回答by QHarr

相关推荐

最近更新

标签

使用 VBA excel 多个类名从网站中提取数据

提问by gepex

回答by SWa

回答by JS20'07'11

回答by QHarr

相关推荐

string bash：如何在字符串中添加空格？

vba 如何使用vb/ssis获取每张excel的名称

string ResponseWriter.Write 和 io.WriteString 有什么区别？

如果在 VBA 中出现错误？

相关推荐

最近更新

标签