在 VB.Net 中读取 HTML 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26130654/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 18:17:53  来源:igfitidea点击:

Read HTML File in VB.Net

vb.netwinformsdom

提问by rheitzman

I have some files that were displayed in a browse and then I used File, Save As.. to place the text in a local file. The page has some scripting and it will not display properly in a WebBrowserControl on a WinForm. The problem appears to be scripts as the control displays "script error" dialogs. I don't really need to view the file but to just retrieve a few elements by ID.

我有一些文件显示在浏览器中,然后我使用文件、另存为 .. 将文本放在本地文件中。该页面有一些脚本,它不会在 WinForm 上的 WebBrowserControl 中正确显示。问题似乎是脚本,因为控件显示“脚本错误”对话框。我真的不需要查看文件,而只是通过 ID 检索一些元素。

The first block of code below does load the file into a local object, but only the first 4096 bytes. (Same happens if I use a WebBrowser resident on the form.)

下面的第一个代码块确实将文件加载到本地对象中,但只有前 4096 个字节。(如果我在表单上使用驻留 WebBrowser,也会发生同样的情况。)

The second block doesn't complain but the GetElementByID fails as the desired element is beyond the first 4096.

第二个块没有抱怨,但 GetElementByID 失败,因为所需的元素超出了第一个 4096。

    Dim web As New WebBrowser
    web.AllowWebBrowserDrop = False
    web.ScriptErrorsSuppressed = True
    web.Url = New Uri(sFile)

    Dim doc As HtmlDocument
    Dim elem As HtmlElement
    doc = web.Document
    elem = doc.GetElementById("userParts")

What am I doing wrong?

我究竟做错了什么?

Is there a better approach for a VB.Net WinForm project for loading an HTML document from which I can read elements?

对于 VB.Net WinForm 项目,是否有更好的方法来加载我可以从中读取元素的 HTML 文档?



I just went with string functions for the simple task at hand:

我只是使用字符串函数来完成手头的简单任务:

    Function GetInnerTextByID(html As String, elemID As String) As String
    Try
        Dim s As String = html.Substring(html.IndexOf("<body>"))
        s = s.Substring(s.IndexOf(elemID))
        s = s.Substring(s.IndexOf(">") + 1)
        s = s.Substring(0, s.IndexOf("<"))
        s = s.Replace(vbCr, "").Replace(vbLf, "").Trim
        Return s
    Catch ex As Exception
        Return ""
    End Try
End Function

I'd still be interested in a native VB.Net (non-ASP) approach. Or why the OP only loads 4096 bytes.

我仍然对原生 VB.Net(非 ASP)方法感兴趣。或者为什么 OP 只加载 4096 字节。

回答by Tim Schmelter

I would use HtmlAgilityPackinstead.

我会用HtmlAgilityPack

You: "True - but overly complex for my simple task of extracting a few elements by ID."

你:“是的 - 但对于我通过 ID 提取一些元素的简单任务来说过于复杂。”

It has also a document.GetElementbyIdmethod which is rather simple. And it has no strange issues with scripts or bytes. Just load the document from web, stream, file or from a plain string.

它还有一个document.GetElementbyId方法比较简单。它没有脚本或字节的奇怪问题。只需从网络、流、文件或普通字符串加载文档。

For example (web):

例如(网络):

Dim document As New HtmlAgilityPack.HtmlDocument
Dim myHttpWebRequest = CType(WebRequest.Create("URL"), HttpWebRequest)
myHttpWebRequest.UserAgent = "Mozilla/5.0 (compat ble; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
Dim streamRead = New StreamReader(CType(myHttpWebRequest.GetResponse(), HttpWebResponse).GetResponseStream)
Dim res As HttpWebResponse = CType(myHttpWebRequest.GetResponse(), HttpWebResponse)
document.Load(res.GetResponseStream(), True)

Dim node As HtmlNode = document.GetElementbyId("userParts")

or from file:

或来自文件:

document.Load("Path")

or from string(f.e. a whole webpage in a html-file read by File.ReadAllText):

或从字符串(fe 读取的 html 文件中的整个网页File.ReadAllText):

document.LoadHtml("HTML")