在 VBA 中解析 HTML 以从描述列表中提取信息？

Question

提问by MostlyHarmless

I want to extract information from a website with Excel XP.

我想从网站中提取信息Excel XP。

I found some example code (http://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm) and tried the following:

我找到了一些示例代码（http://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm）并尝试了以下操作：

Function strHtmlElementValue(htmldoc As HTMLDocument, id As String) As String
Dim HtmlElement As IHTMLElement
Set HtmlElement = htmldoc.getElementById(id)
strHtmlElementValue = id & ": " & HtmlElement.innerText
End Function

I tried it with the following URL (loaded as the htmldoc): http://www.immobilienscout24.de/expose/73940554

我尝试使用以下 URL（加载为htmldoc）：http: //www.immobilienscout24.de/expose/73940554

If I use the string "expose-title" for the id, the function returns the title of the page, which is fine.

如果我对 id 使用字符串“expose-title”，则该函数返回页面的标题，这很好。

But how can I access e. g. information like the price?

但是我怎样才能访问诸如价格之类的信息呢？

In the Html code, it looks like that. There is no ID and if I try to use the class-name "is24qa-kaufpreis" for getelementbyid, I get an error message.

在 Html 代码中，它看起来像这样。没有 ID，如果我尝试为使用类名“is24qa-kaufpreis” getelementbyid，我会收到一条错误消息。

   <dl>
        <dt>
          <strong class="is24qa-kaufpreis-label">
            Kaufpreis:
          </strong>
        </dt>
        <dd class="is24qa-kaufpreis">
          2.190.000,00 EUR
        </dd>
  </dl>

So, is there a way to access fields like this "is24qa-kaufpreis" directly and read out the inner text (in this case the 2.190.000,00 EUR?

那么，有没有办法直接访问像“is24qa-kaufpreis”这样的字段并读出内部文本（在这种情况下是2.190.000,00 EUR？

Answer 1

采纳答案by ron

There are a number of different ways you could go about it. The following code shows two approaches based on "getElementsByTagName." In the source code for the web page, if you can count which instance of div "dd" kaufpreis is in, then you could use the first method. A more general approach is shown following it.

您可以通过多种不同的方式来解决这个问题。以下代码显示了基于“getElementsByTagName”的两种方法。在网页的源代码中，如果你能数出 div "dd" kaufpreis 在哪个实例中，那么你可以使用第一种方法。下面显示了一种更通用的方法。

Sub test()
    my_url = "http://www.immobilienscout24.de/expose/73940554"
    Set html_doc = CreateObject("htmlfile")
    Set xml_obj = CreateObject("MSXML2.XMLHTTP")

    xml_obj.Open "GET", my_url, False
    xml_obj.send
    html_doc.body.innerhtml = xml_obj.responseText
    Set xml_obj = Nothing

    k_pice = html_doc.body.getElementsByTagName("dd")(0).innertext

' Or

    Set Results = html_doc.body.getElementsByTagName("dd")
    For Each itm In Results
        If InStr(1, itm.outerhtml, "EUR", vbTextCompare) > 0 Then
            k_price = itm.innertext
            Exit For
        Else
        End If
    Next
End Sub

Answer 2

回答by Tim Williams

This worked for me. IE11, but should work with IE9+.

这对我有用。IE11，但应该与 IE9+ 一起使用。

Sub TestGEBCN()

Dim doc As New MSHTML.HTMLDocument, html, els

    html = "<dl><dt><strong class=""is24qa-kaufpreis-label"">Kaufpreis:" & _
           "</strong></dt><dd class=""is24qa-kaufpreis"">" & _
           "2.190.000,00 EUR</dd></dl>"

    doc.body.innerHTML = html

    Set els = doc.getElementsByClassName("is24qa-kaufpreis")

    Debug.Print els(0).innerText

End Sub

Answer 3

回答by tony bd

Also remember that Excel can do it's own web queries. On the Data - Import External Data - New Web Query menu (Alt + D, D, W). Then you would refer to it as sheet2!a22 or whatever. It no good for a page that constantly changes it's layout.

还请记住，Excel 可以执行它自己的 Web 查询。在数据 - 导入外部数据 - 新建 Web 查询菜单（Alt + D、D、W）上。然后您将其称为 sheet2!a22 或其他名称。对于不断改变其布局的页面来说，这没有好处。

Answer 4

回答by QHarr

CSS selector:

CSS 选择器：

.querySelectormethod of HTMLDocument to apply a CSS selector of dd[class='is24qa-kaufpreis']

HTMLDocument 的.querySelector方法来应用 CSS 选择器dd[class='is24qa-kaufpreis']

This says get first element with tag name ddhaving classattribute of is24qa-kaufpreis'. "[]"means attribute.

这表示获取dd具有classi 属性的标签名称的第一个元素s24qa-kaufpreis'。"[]"表示属性。

CSS query:

CSS查询：

VBA:

VBA：

htmldocument.querySelector("dd[class='is24qa-kaufpreis']").innerText

You need to obtain the HTMLDocument object but the other answers already show meothds for this.

您需要获取 HTMLDocument 对象，但其他答案已经为此显示了方法。

Answer 5

回答by StandardDeviation

Use

用

getElementsByTagName("strong")(0).InnerText

for Kaufpreis;

考夫普莱斯；

Use

用

getElementsByTagName("dd")(0).InnerText

for 2.190.000,00 EUR.

2.190.000,00 欧元。

(0) is the number of the same tag element, there can be many entries with the same tag name in the code, to retrieve them use ("tag")(0), ("tag")(1),...,("tag")(n).

(0) 是同一个标签元素的个数，代码中可以有很多同名的条目，使用("tag")(0), ("tag")(1),.. .,("标签")(n)。

I suggest researching the topics regarding child or sub elements for automation purposes.

我建议出于自动化目的研究有关子元素或子元素的主题。

在 VBA 中解析 HTML 以从描述列表中提取信息？

提问by MostlyHarmless

采纳答案by ron

回答by Tim Williams

回答by tony bd

回答by QHarr

回答by StandardDeviation

相关推荐

最近更新

标签

在 VBA 中解析 HTML 以从描述列表中提取信息？

提问by MostlyHarmless

采纳答案by ron

回答by Tim Williams

回答by tony bd

回答by QHarr

回答by StandardDeviation

相关推荐

EXCEL 将数据保存到不同的工作簿 - VBA

仅将可见工作表中的可见单元格复制到新工作簿中，excel 2007 VBA

创建用于单击 VBA 上的按钮后移至下一行的代码

vba 如何解决丢失的 Powerpoint 15 对象库错误

相关推荐

最近更新

标签