C# 在 HtmlAgilityPack 中运行脚本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11393075/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-09 17:42:13  来源:igfitidea点击:

Running Scripts in HtmlAgilityPack

c#javascripthtml-agility-pack

提问by Aabela

I'm trying to scrape a particular webpage which works as follows.

我正在尝试抓取一个特定的网页,其工作方式如下。

First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.

首先页面加载,然后它运行某种 javascript 来获取填充页面所需的数据。我对那个数据很感兴趣。

If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.

如果我使用 HtmlAgilityPack 获取页面 - 脚本不会运行,所以我得到它基本上是一个空白页面。

Is there a way to force it to run a script, so I can get the data?

有没有办法强制它运行脚本,以便我可以获取数据?

采纳答案by Jamie Treworgy

You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.

您正在获取服务器返回的内容 - 与 Web 浏览器相同。当然,Web 浏览器会运行这些脚本。Html Agility Pack 只是一个 HTML 解析器 - 它无法解释 javascript 或将其绑定到文档的内部表示。如果您想运行该脚本,则需要一个 Web 浏览器。您的问题的完美答案将是一个完整的“无头”网络浏览器。它结合了 HTML 解析器、javascript 解释器和模拟浏览器 DOM 的模型,所有这些都协同工作。基本上,这是一个网络浏览器,除了它的渲染部分。目前还没有完全在 .NET 环境中工作的东西。

Your best bet is to use a WebBrowsercontrol and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.

最好的办法是使用WebBrowser控件并在 Internet Explorer 中在编程控制下实际加载和运行页面。这不会很快或很漂亮,但它会做你需要做的事情。

Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Netwhich discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.

另请参阅我对类似问题的回答:Load a DOM and Execute javascript, server side, with .Net讨论了 .NET 中可用的技术来执行此操作。不幸的是,大多数作品现在都存在,但还没有完全存在或没有以正确的方式集成。

回答by M?ns T?nneryd

You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.

您可以为此使用 Awesomium,http://www.awesomium.com/ 。它运行良好,但不支持 x64 并且不是线程安全的。我用它来 24x7 扫描一些网站,它至少连续几天运行良好,但通常会崩溃。