javascript 如何使用 .NET 的 WebBrowser 或 mshtml.HTMLDocument 动态生成 HTML 代码？

Question

提问by J Smith

Most of the answers I have read concerning this subject point to either the System.Windows.Forms.WebBrowser class or the COM interface mshtml.HTMLDocument from the Microsoft HTML Object Library assembly.

我读过的关于这个主题的大多数答案都指向 System.Windows.Forms.WebBrowser 类或来自 Microsoft HTML 对象库程序集的 COM 接口 mshtml.HTMLDocument。

The WebBrowser class did not lead me anywhere. The following code fails to retrieve the HTML code as rendered by my web browser:

WebBrowser 类并没有带我到任何地方。以下代码无法检索由我的 Web 浏览器呈现的 HTML 代码：

[STAThread]
public static void Main()
{
    WebBrowser wb = new WebBrowser();
    wb.Navigate("https://www.google.com/#q=where+am+i");

    wb.DocumentCompleted += delegate(object sender, WebBrowserDocumentCompletedEventArgs e)
    {
        mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)wb.Document.DomDocument;
        foreach (IHTMLElement element in doc.all)
        {
                    System.Diagnostics.Debug.WriteLine(element.outerHTML);
        }     
    };
    Form f = new Form();
    f.Controls.Add(wb);
    Application.Run(f);
}

The above is just an example. I'm not really interested in finding a workaround for figuring out the name of the town where I am located. I simply need to understand how to retrieve that kind of dynamically generated data programmatically.

以上只是一个例子。我对找到一种解决方法来找出我所在城镇的名称并不感兴趣。我只需要了解如何以编程方式检索那种动态生成的数据。

(Call new System.Net.WebClient.DownloadString("https://www.google.com/#q=where+am+i"), save the resulting text somewhere, search for the name of the town where you are currently located, and let me know if you were able to find it.)

（调用 new System.Net.WebClient.DownloadString(" https://www.google.com/#q=where+am+i")，将结果文本保存在某处，搜索您当前所在城镇的名称找到了，如果你能找到它，请告诉我。）

But yet when I access "https://www.google.com/#q=where+am+i" from my Web Browser (ie or firefox) I see the name of my town written on the web page. In Firefox, if I right click on the name of the town and select "Inspect Element (Q)" I clearly see the name of the town written in the HTML code which happens to look quite different from the raw HTML that is returned by WebClient.

但是当我从我的网络浏览器（即或 firefox）访问“ https://www.google.com/#q=where+am+i”时，我看到我的城镇名称写在网页上。在 Firefox 中，如果我右键单击城镇名称并选择“检查元素 (Q)”，我会清楚地看到写在 HTML 代码中的城镇名称，这与 WebClient 返回的原始 HTML 看起来完全不同.

After I got tired of playing System.Net.WebBrowser, I decided to give mshtml.HTMLDocument a shot, just to end up with the same useless raw HTML:

在我玩腻了 System.Net.WebBrowser 之后，我决定试一试 mshtml.HTMLDocument，只是为了得到同样无用的原始 HTML：

public static void Main()
{
    mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)new mshtml.HTMLDocument();
    doc.write(new System.Net.WebClient().DownloadString("https://www.google.com/#q=where+am+i"));

    foreach (IHTMLElement e in doc.all)
    {
            System.Diagnostics.Debug.WriteLine(e.outerHTML);
    }
}

I suppose there must be an elegant way to obtain this kind of information. Right now all I can think of is add a WebBrowser control to a form, have it navigate to the URL in question, send the keys "CLRL, A", and copy whatever happens to be displayed on the page to the clipboard and attempt to parse it. That's horrible solution, though.

我想必须有一种优雅的方式来获取这种信息。现在我能想到的就是向表单添加一个 WebBrowser 控件，让它导航到有问题的 URL，发送键“CLRL，A”，然后将页面上显示的任何内容复制到剪贴板并尝试解析它。不过，这是一个可怕的解决方案。

Answer 1

回答by noseratio

I'd like to contribute some code to Alexei's answer. A few points:

我想为Alexei 的回答贡献一些代码。几点：

Strictly speaking, it may not always be possible to determine when the page has finished rendering with 100% probability. Some pages are quite complex and use continuous AJAX updates. But we can get quite close, by polling the page's current HTML snapshot for changes and checking the WebBrowser.IsBusyproperty. That's what LoadDynamicPagedoes below.
Some time-out logic has to be present on top of the above, in case the page rendering is never-ending (note CancellationTokenSource).
Async/awaitis a great tool for coding this, as it gives the linear code flow to our asynchronous polling logic, which greatly simplifies it.
It's important to enable HTML5 rendering using Browser Feature Control, as WebBrowserruns in IE7 emulation mode by default. That's what SetFeatureBrowserEmulationdoes below.
This is a WinForms app, but the concept can be easily converted into a console app.
This logic works well on the URL you've specifically mentioned: https://www.google.com/#q=where+am+i.

严格来说，并不总是可以确定页面何时以 100% 的概率完成渲染。一些页面非常复杂并且使用持续的 AJAX 更新。但是我们可以通过轮询页面的当前 HTML 快照以获取更改并检查WebBrowser.IsBusy属性来获得非常接近的结果。这就是 LoadDynamicPage下面所做的。
一些超时逻辑必须存在于上述之上，以防页面呈现永无止境（注意CancellationTokenSource）。
Async/await是一个很好的编码工具，因为它为我们的异步轮询逻辑提供了线性代码流，这大大简化了它。
使用Browser Feature Control启用 HTML5 渲染非常重要，因为WebBrowser默认情况下在 IE7 仿真模式下运行。这就是SetFeatureBrowserEmulation下面所做的。
这是一个 WinForms 应用程序，但该概念可以轻松转换为控制台应用程序。
此逻辑适用于您特别提到的 URL：https://www.google.com/#q=where+am+i。

using Microsoft.Win32;
using System;
using System.ComponentModel;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace WbFetchPage
{
    public partial class MainForm : Form
    {
        public MainForm()
        {
            SetFeatureBrowserEmulation();
            InitializeComponent();
            this.Load += MainForm_Load;
        }

        // start the task
        async void MainForm_Load(object sender, EventArgs e)
        {
            try
            {
                var cts = new CancellationTokenSource(10000); // cancel in 10s
                var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token);
                MessageBox.Show(html.Substring(0, 1024) + "..." ); // it's too long!
            }
            catch (Exception ex)
            {
                MessageBox.Show(ex.Message);
            }
        }

        // navigate and download 
        async Task<string> LoadDynamicPage(string url, CancellationToken token)
        {
            // navigate and await DocumentCompleted
            var tcs = new TaskCompletionSource<bool>();
            WebBrowserDocumentCompletedEventHandler handler = (s, arg) =>
                tcs.TrySetResult(true);

            using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
            {
                this.webBrowser.DocumentCompleted += handler;
                try 
                {           
                    this.webBrowser.Navigate(url);
                    await tcs.Task; // wait for DocumentCompleted
                }
                finally
                {
                    this.webBrowser.DocumentCompleted -= handler;
                }
            }

            // get the root element
            var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];

            // poll the current HTML for changes asynchronosly
            var html = documentElement.OuterHtml;
            while (true)
            {
                // wait asynchronously, this will throw if cancellation requested
                await Task.Delay(500, token); 

                // continue polling if the WebBrowser is still busy
                if (this.webBrowser.IsBusy)
                    continue; 

                var htmlNow = documentElement.OuterHtml;
                if (html == htmlNow)
                    break; // no changes detected, end the poll loop

                html = htmlNow;
            }

            // consider the page fully rendered 
            token.ThrowIfCancellationRequested();
            return html;
        }

        // enable HTML5 (assuming we're running IE10+)
        // more info: https://stackoverflow.com/a/18333982/1768303
        static void SetFeatureBrowserEmulation()
        {
            if (LicenseManager.UsageMode != LicenseUsageMode.Runtime)
                return;
            var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName);
            Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
                appName, 10000, RegistryValueKind.DWord);
        }
    }
}

Answer 2

回答by Alexei Levenkov

Your web-browser code looks reasonable - wait for something, that grab current content. Unfortunately there is no official "I'm done executing JavaScript, feel free to steal content" notification from browser nor JavaScript.

您的网络浏览器代码看起来很合理 - 等待一些抓取当前内容的东西。不幸的是，浏览器和 JavaScript 都没有正式的“我已完成 JavaScript 的执行，可以随意窃取内容”通知。

Some sort of active wait (not Sleepbut Timer) may be necessary and page-specific. Even if you use headless browser (i.e. PhantomJS) you'll have the same issue.

某种活动等待（不是Sleep但Timer）可能是必要的并且是特定于页面的。即使您使用无头浏览器（即 PhantomJS），您也会遇到同样的问题。

javascript 如何使用 .NET 的 WebBrowser 或 mshtml.HTMLDocument 动态生成 HTML 代码？

提问by J Smith

回答by noseratio

回答by Alexei Levenkov

相关推荐

最近更新

标签

javascript 如何使用 .NET 的 WebBrowser 或 mshtml.HTMLDocument 动态生成 HTML 代码？

提问by J Smith

回答by noseratio

回答by Alexei Levenkov

相关推荐

javascript 从错误处理程序内部获取对 Kendo Grid 的引用

javascript Ruby Rails - 为控制器动作的 AJAX 调用构建数据

javascript 如何使用 underscore.js 库中的 _.where 方法进行更详细的搜索

将文件保存到选定目录 (javascript)

相关推荐

最近更新

标签