Html 使用vba从网站上抓取数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27066963/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scraping data from website using vba
提问by rockNroll
Im trying to scrape data from website: http://uk.investing.com/rates-bonds/financial-futuresvia vba, like real-time price, i.e. German 5 YR Bobl, US 30Y T-Bond, i have tried excel web query but it only scrapes the whole website, but I would like to scrape the rate only, is there a way of doing this?
我试图从网站上抓取数据:http: //uk.investing.com/rates-bonds/financial-futures通过 vba,比如实时价格,即德国 5 年波布尔,美国 30 年国债,我尝试过 excel网页查询但它只抓取整个网站,但我只想抓取速率,有没有办法做到这一点?
回答by Matteo NNZ
There are several ways of doing this. This is an answer that I write hoping that all the basics of Internet Explorer automation will be found when browsing for the keywords "scraping data from website", but remember that nothing's worth as your own research (if you don't want to stick to pre-written codes that you're not able to customize).
有几种方法可以做到这一点。这是我写的一个答案,希望在浏览关键字“从网站上抓取数据”时可以找到 Internet Explorer 自动化的所有基础知识,但请记住,没有任何东西值得您自己研究(如果您不想坚持您无法自定义的预先编写的代码)。
Please note that this is one way, that I don't prefer in terms of performance (since it depends on the browser speed) but that is good to understand the rationale behind Internet automation.
请注意,这是一种方式,我不喜欢在性能方面(因为它取决于浏览器速度),但这有助于理解 Internet 自动化背后的基本原理。
1) If I need to browse the web, I need a browser! So I create an Internet Explorer browser:
1)如果我需要浏览网页,我需要一个浏览器!所以我创建了一个 Internet Explorer 浏览器:
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
2) I ask the browser to browse the target webpage. Through the use of the property ".Visible", I decide if I want to see the browser doing its job or not. When building the code is nice to have Visible = True
, but when the code is working for scraping data is nice not to see it everytime so Visible = False
.
2) 我让浏览器浏览目标网页。通过使用“.Visible”属性,我决定是否要查看浏览器是否完成其工作。构建代码时很高兴Visible = True
,但是当代码用于抓取数据时,最好不要每次都看到它Visible = False
。
With appIE
.Navigate "http://uk.investing.com/rates-bonds/financial-futures"
.Visible = True
End With
3) The webpage will need some time to load. So, I will wait meanwhile it's busy...
3) 网页需要一些时间来加载。所以,我会在它忙的时候等待......
Do While appIE.Busy
DoEvents
Loop
4) Well, now the page is loaded. Let's say that I want to scrape the change of the US30Y T-Bond: What I will do is just clicking F12 on Internet Explorer to see the webpage's code, and hence using the pointer (in red circle) I will click on the element that I want to scrape to see how can I reach my purpose.
4) 好了,现在页面被加载了。假设我想抓取 US30Y T-Bond 的变化:我要做的只是在 Internet Explorer 上单击 F12 以查看网页代码,因此使用指针(红色圆圈中)我将单击该元素我想刮,看看我怎样才能达到我的目的。
5) What I should do is straight-forward. First of all, I will get by the ID property the tr
element which is containing the value:
5)我应该做的是直截了当的。首先,我将通过 ID 属性获取tr
包含该值的元素:
Set allRowOfData = appIE.document.getElementById("pair_8907")
Here I will get a collection of td
elements (specifically, tr
is a row of data, and the td
are its cells. We are looking for the 8th, so I will write:
在这里我会得到一个td
元素的集合(具体来说,tr
是一行数据,它们td
是它的单元格。我们正在寻找第8个,所以我会写:
Dim myValue As String: myValue = allRowOfData.Cells(7).innerHTML
Why did I write 7 instead of 8? Because the collections of cells starts from 0, so the index of the 8th element is 7 (8-1). Shortly analysing this line of code:
为什么我写的是 7 而不是 8?因为单元格的集合从0开始,所以第8个元素的索引是7(8-1)。简单分析这行代码:
.Cells()
makes me access thetd
elements;innerHTML
is the property of the cell containing the value we look for.
.Cells()
让我访问td
元素;innerHTML
是包含我们要查找的值的单元格的属性。
Once we have our value, which is now stored into the myValue
variable, we can just close the IE browser and releasing the memory by setting it to Nothing:
一旦我们获得了现在存储在myValue
变量中的值,我们就可以关闭 IE 浏览器并通过将其设置为 Nothing 来释放内存:
appIE.Quit
Set appIE = Nothing
Well, now you have your value and you can do whatever you want with it: put it into a cell (Range("A1").Value = myValue
), or into a label of a form (Me.label1.Text = myValue
).
好吧,现在你有了你的价值,你可以用它做任何你想做的事情:把它放到一个单元格 ( Range("A1").Value = myValue
) 中,或者放到一个表单的标签中 ( Me.label1.Text = myValue
)。
I'd just like to point you out that this is not how StackOverflow works: here you post questions about specific coding problems, but you should make your own search first. The reason why I'm answering a question which is not showing too much research effort is just that I see it asked several times and, back to the time when I learned how to do this, I remember that I would have liked having some better support to get started with. So I hope that this answer, which is just a "study input" and not at all the best/most complete solution, can be a support for next user having your same problem. Because I have learned how to program thanks to this community, and I like to think that you and other beginners might use my input to discover the beautiful world of programming.
我只想向您指出,这不是 StackOverflow 的工作方式:在这里您可以发布有关特定编码问题的问题,但您应该先进行自己的搜索。我回答一个没有表现出太多研究努力的问题的原因只是我看到它被问了好几次,回到我学会如何做到这一点的时候,我记得我本来希望有一些更好的支持开始。所以我希望这个答案,这只是一个“研究输入”,而不是最好/最完整的解决方案,可以为下一个遇到同样问题的用户提供支持。因为感谢这个社区,我学会了如何编程,我想你和其他初学者可能会使用我的输入来发现编程的美丽世界。
Enjoy your practice ;)
享受你的练习;)
回答by QHarr
Other methods were mentioned so let us please acknowledge that, at the time of writing, we are in the 21st century. Let's park the local busbrowser opening, and flywith an XMLHTTP GETrequest (XHR GET for short).
还提到了其他方法,所以让我们承认,在撰写本文时,我们正处于 21 世纪。让我们将本地总线浏览器打开,然后使用XMLHTTP GET请求(简称 XHR GET)进行飞行。
XHR is an API in the form of an object whose methods transfer data between a web browser and a web server. The object is provided by the browser's JavaScript environment
XHR 是一种对象形式的 API,其方法在 Web 浏览器和 Web 服务器之间传输数据。对象由浏览器的 JavaScript 环境提供
It's a fast method for retrieving data that doesn't require opening a browser. The server response can be read into an HTMLDocument and the process of grabbing the table continued from there.
这是一种无需打开浏览器即可检索数据的快速方法。可以将服务器响应读入 HTMLDocument 并从那里继续抓取表格的过程。
In the below code, the table is grabbed by its id cr1
.
在下面的代码中,表格由其 id 抓取cr1
。
In the helper sub, WriteTable
, we loop the columns (td
tags) and then the table rows (tr
tags), and finally traverse the length of each table row, table cell by table cell. As we only want data from columns 1 and 8, a Select Case
statement is used specify what is written out to the sheet.
在辅助子中WriteTable
,我们循环列(td
标签),然后是表格行(tr
标签),最后遍历每个表格行的长度,逐个表格单元格。由于我们只需要第 1 列和第 8 列的数据,Select Case
因此使用了一条语句来指定写出到工作表的内容。
Sample webpage view:
示例网页视图:
Sample code output:
示例代码输出:
VBA:
VBA:
Option Explicit
Public Sub GetRates()
Dim html As HTMLDocument, hTable As HTMLTable '<== Tools > References > Microsoft HTML Object Library
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://uk.investing.com/rates-bonds/financial-futures", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
.send
html.body.innerHTML = .responseText
End With
Application.ScreenUpdating = False
Set hTable = html.getElementById("cr1")
WriteTable hTable, 1, ThisWorkbook.Worksheets("Sheet1")
Application.ScreenUpdating = True
End Sub
Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
Dim tSection As Object, tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, C As Long, tBody As Object
r = startRow: If ws Is Nothing Then Set ws = ActiveSheet
With ws
Dim headers As Object, header As Object, columnCounter As Long
Set headers = hTable.getElementsByTagName("th")
For Each header In headers
columnCounter = columnCounter + 1
Select Case columnCounter
Case 2
.Cells(startRow, 1) = header.innerText
Case 8
.Cells(startRow, 2) = header.innerText
End Select
Next header
startRow = startRow + 1
Set tBody = hTable.getElementsByTagName("tbody")
For Each tSection In tBody
Set tRow = tSection.getElementsByTagName("tr")
For Each tr In tRow
r = r + 1
Set tCell = tr.getElementsByTagName("td")
C = 1
For Each td In tCell
Select Case C
Case 2
.Cells(r, 1).Value = td.innerText
Case 8
.Cells(r, 2).Value = td.innerText
End Select
C = C + 1
Next td
Next tr
Next tSection
End With
End Sub
回答by Nitin k tomar
you can use winhttprequest object instead of internet explorer as it's good to load data excluding pictures n advertisement instead of downloading full webpage including advertisement n pictures those make internet explorer object heavy compare to winhttpRequest object.
您可以使用 winhttprequest 对象而不是 Internet Explorer,因为加载不包括图片和广告的数据是很好的,而不是下载包括广告和图片在内的完整网页,这些图片使 Internet Explorer 对象与 winhttpRequest 对象相比很重。
回答by Roshantha De Mel
This question asked long before. But I thought following information will useful for newbies. Actually you can easily get the values from class name like this.
这个问题很久以前就问过了。但我认为以下信息对新手有用。实际上,您可以像这样轻松地从类名中获取值。
Sub ExtractLastValue()
Set objIE = CreateObject("InternetExplorer.Application")
objIE.Top = 0
objIE.Left = 0
objIE.Width = 800
objIE.Height = 600
objIE.Visible = True
objIE.Navigate ("https://uk.investing.com/rates-bonds/financial-futures/")
Do
DoEvents
Loop Until objIE.readystate = 4
MsgBox objIE.document.getElementsByClassName("pid-8907-last")(0).innerText
End Sub
And if you are new to web scraping please read this blog post.
如果您不熟悉网络抓取,请阅读这篇博文。
And also there are various techniques to extract data from web pages. This article explain few of them with examples.
还有各种技术可以从网页中提取数据。本文通过示例解释了其中的一些。
回答by Roshantha De Mel
I modified some thing that were poping up error for me and end up with this which worked great to extract the data as I needed:
我修改了一些为我弹出错误的内容,最终得到了很好的效果,可以根据需要提取数据:
Sub get_data_web()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
With appIE
.navigate "https://finance.yahoo.com/quote/NQ%3DF/futures?p=NQ%3DF"
.Visible = True
End With
Do While appIE.Busy
DoEvents
Loop
Set allRowofData = appIE.document.getElementsByClassName("Ta(end) BdT Bdc($c-fuji-grey-c) H(36px)")
Dim i As Long
Dim myValue As String
Count = 1
For Each itm In allRowofData
For i = 0 To 4
myValue = itm.Cells(i).innerText
ActiveSheet.Cells(Count, i + 1).Value = myValue
Next
Count = Count + 1
Next
appIE.Quit
Set appIE = Nothing
End Sub