在 VBA 中解析 HTML 以从描述列表中提取信息?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23303551/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse HTML in VBA to extract information from description list?
提问by MostlyHarmless
I want to extract information from a website with Excel XP
.
我想从网站中提取信息Excel XP
。
I found some example code (http://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm) and tried the following:
我找到了一些示例代码(http://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm)并尝试了以下操作:
Function strHtmlElementValue(htmldoc As HTMLDocument, id As String) As String
Dim HtmlElement As IHTMLElement
Set HtmlElement = htmldoc.getElementById(id)
strHtmlElementValue = id & ": " & HtmlElement.innerText
End Function
I tried it with the following URL (loaded as the htmldoc
): http://www.immobilienscout24.de/expose/73940554
我尝试使用以下 URL(加载为htmldoc
):http: //www.immobilienscout24.de/expose/73940554
If I use the string "expose-title" for the id, the function returns the title of the page, which is fine.
如果我对 id 使用字符串“expose-title”,则该函数返回页面的标题,这很好。
But how can I access e. g. information like the price?
但是我怎样才能访问诸如价格之类的信息呢?
In the Html code, it looks like that. There is no ID and if I try to use the class-name "is24qa-kaufpreis" for getelementbyid
, I get an error message.
在 Html 代码中,它看起来像这样。没有 ID,如果我尝试为 使用类名“is24qa-kaufpreis” getelementbyid
,我会收到一条错误消息。
<dl>
<dt>
<strong class="is24qa-kaufpreis-label">
Kaufpreis:
</strong>
</dt>
<dd class="is24qa-kaufpreis">
2.190.000,00 EUR
</dd>
</dl>
So, is there a way to access fields like this "is24qa-kaufpreis" directly and read out the inner text (in this case the 2.190.000,00 EUR?
那么,有没有办法直接访问像“is24qa-kaufpreis”这样的字段并读出内部文本(在这种情况下是2.190.000,00 EUR?
采纳答案by ron
There are a number of different ways you could go about it. The following code shows two approaches based on "getElementsByTagName." In the source code for the web page, if you can count which instance of div "dd" kaufpreis is in, then you could use the first method. A more general approach is shown following it.
您可以通过多种不同的方式来解决这个问题。以下代码显示了基于“getElementsByTagName”的两种方法。在网页的源代码中,如果你能数出 div "dd" kaufpreis 在哪个实例中,那么你可以使用第一种方法。下面显示了一种更通用的方法。
Sub test()
my_url = "http://www.immobilienscout24.de/expose/73940554"
Set html_doc = CreateObject("htmlfile")
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
xml_obj.Open "GET", my_url, False
xml_obj.send
html_doc.body.innerhtml = xml_obj.responseText
Set xml_obj = Nothing
k_pice = html_doc.body.getElementsByTagName("dd")(0).innertext
' Or
Set Results = html_doc.body.getElementsByTagName("dd")
For Each itm In Results
If InStr(1, itm.outerhtml, "EUR", vbTextCompare) > 0 Then
k_price = itm.innertext
Exit For
Else
End If
Next
End Sub
回答by Tim Williams
This worked for me. IE11, but should work with IE9+.
这对我有用。IE11,但应该与 IE9+ 一起使用。
Sub TestGEBCN()
Dim doc As New MSHTML.HTMLDocument, html, els
html = "<dl><dt><strong class=""is24qa-kaufpreis-label"">Kaufpreis:" & _
"</strong></dt><dd class=""is24qa-kaufpreis"">" & _
"2.190.000,00 EUR</dd></dl>"
doc.body.innerHTML = html
Set els = doc.getElementsByClassName("is24qa-kaufpreis")
Debug.Print els(0).innerText
End Sub
回答by tony bd
Also remember that Excel can do it's own web queries. On the Data - Import External Data - New Web Query menu (Alt + D, D, W). Then you would refer to it as sheet2!a22 or whatever. It no good for a page that constantly changes it's layout.
还请记住,Excel 可以执行它自己的 Web 查询。在数据 - 导入外部数据 - 新建 Web 查询菜单(Alt + D、D、W)上。然后您将其称为 sheet2!a22 或其他名称。对于不断改变其布局的页面来说,这没有好处。
回答by QHarr
.querySelectormethod of HTMLDocument to apply a CSS selector of dd[class='is24qa-kaufpreis']
HTMLDocument 的.querySelector方法来应用 CSS 选择器dd[class='is24qa-kaufpreis']
This says get first element with tag name dd
having class
attribute of is24qa-kaufpreis'
. "[]"
means attribute.
这表示获取dd
具有class
i 属性的标签名称的第一个元素s24qa-kaufpreis'
。"[]"
表示属性。
CSS query:
CSS查询:
VBA:
VBA:
htmldocument.querySelector("dd[class='is24qa-kaufpreis']").innerText
You need to obtain the HTMLDocument object but the other answers already show meothds for this.
您需要获取 HTMLDocument 对象,但其他答案已经为此显示了方法。
回答by StandardDeviation
Use
用
getElementsByTagName("strong")(0).InnerText
for Kaufpreis;
考夫普莱斯;
Use
用
getElementsByTagName("dd")(0).InnerText
for 2.190.000,00 EUR.
2.190.000,00 欧元。
(0) is the number of the same tag element, there can be many entries with the same tag name in the code, to retrieve them use ("tag")(0), ("tag")(1),...,("tag")(n).
(0) 是同一个标签元素的个数,代码中可以有很多同名的条目,使用("tag")(0), ("tag")(1),.. .,("标签")(n)。
I suggest researching the topics regarding child or sub elements for automation purposes.
我建议出于自动化目的研究有关子元素或子元素的主题。