在 VB.Net 中读取 HTML 文件

Question

提问by rheitzman

I have some files that were displayed in a browse and then I used File, Save As.. to place the text in a local file. The page has some scripting and it will not display properly in a WebBrowserControl on a WinForm. The problem appears to be scripts as the control displays "script error" dialogs. I don't really need to view the file but to just retrieve a few elements by ID.

我有一些文件显示在浏览器中，然后我使用文件、另存为 .. 将文本放在本地文件中。该页面有一些脚本，它不会在 WinForm 上的 WebBrowserControl 中正确显示。问题似乎是脚本，因为控件显示“脚本错误”对话框。我真的不需要查看文件，而只是通过 ID 检索一些元素。

The first block of code below does load the file into a local object, but only the first 4096 bytes. (Same happens if I use a WebBrowser resident on the form.)

下面的第一个代码块确实将文件加载到本地对象中，但只有前 4096 个字节。（如果我在表单上使用驻留 WebBrowser，也会发生同样的情况。）

The second block doesn't complain but the GetElementByID fails as the desired element is beyond the first 4096.

第二个块没有抱怨，但 GetElementByID 失败，因为所需的元素超出了第一个 4096。

    Dim web As New WebBrowser
    web.AllowWebBrowserDrop = False
    web.ScriptErrorsSuppressed = True
    web.Url = New Uri(sFile)

    Dim doc As HtmlDocument
    Dim elem As HtmlElement
    doc = web.Document
    elem = doc.GetElementById("userParts")

What am I doing wrong?

我究竟做错了什么？

Is there a better approach for a VB.Net WinForm project for loading an HTML document from which I can read elements?

对于 VB.Net WinForm 项目，是否有更好的方法来加载我可以从中读取元素的 HTML 文档？

I just went with string functions for the simple task at hand:

我只是使用字符串函数来完成手头的简单任务：

    Function GetInnerTextByID(html As String, elemID As String) As String
    Try
        Dim s As String = html.Substring(html.IndexOf("<body>"))
        s = s.Substring(s.IndexOf(elemID))
        s = s.Substring(s.IndexOf(">") + 1)
        s = s.Substring(0, s.IndexOf("<"))
        s = s.Replace(vbCr, "").Replace(vbLf, "").Trim
        Return s
    Catch ex As Exception
        Return ""
    End Try
End Function

I'd still be interested in a native VB.Net (non-ASP) approach. Or why the OP only loads 4096 bytes.

我仍然对原生 VB.Net（非 ASP）方法感兴趣。或者为什么 OP 只加载 4096 字节。

Answer 1

回答by Tim Schmelter

I would use HtmlAgilityPackinstead.

我会用HtmlAgilityPack。

You: "True - but overly complex for my simple task of extracting a few elements by ID."

你：“是的 - 但对于我通过 ID 提取一些元素的简单任务来说过于复杂。”

It has also a document.GetElementbyIdmethod which is rather simple. And it has no strange issues with scripts or bytes. Just load the document from web, stream, file or from a plain string.

它还有一个document.GetElementbyId方法比较简单。它没有脚本或字节的奇怪问题。只需从网络、流、文件或普通字符串加载文档。

For example (web):

例如（网络）：

Dim document As New HtmlAgilityPack.HtmlDocument
Dim myHttpWebRequest = CType(WebRequest.Create("URL"), HttpWebRequest)
myHttpWebRequest.UserAgent = "Mozilla/5.0 (compat ble; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
Dim streamRead = New StreamReader(CType(myHttpWebRequest.GetResponse(), HttpWebResponse).GetResponseStream)
Dim res As HttpWebResponse = CType(myHttpWebRequest.GetResponse(), HttpWebResponse)
document.Load(res.GetResponseStream(), True)

Dim node As HtmlNode = document.GetElementbyId("userParts")

or from file:

或来自文件：

document.Load("Path")

or from string(f.e. a whole webpage in a html-file read by File.ReadAllText):

或从字符串（fe 读取的 html 文件中的整个网页File.ReadAllText）：

document.LoadHtml("HTML")

在 VB.Net 中读取 HTML 文件

提问by rheitzman

回答by Tim Schmelter

相关推荐

最近更新

标签

在 VB.Net 中读取 HTML 文件

提问by rheitzman

回答by Tim Schmelter

相关推荐

vb.net 将一个表单的类继承到另一个表单类

使用 VB.NET 在 Access 中创建表

vb.net 在数据网格视图 .NET 中设置单元格焦点

vb.net 将 unix 时间转换为 DateTime

相关推荐

最近更新

标签