C# 在 HtmlAgilityPack 中运行脚本

Question

提问by Aabela

I'm trying to scrape a particular webpage which works as follows.

我正在尝试抓取一个特定的网页，其工作方式如下。

First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.

首先页面加载，然后它运行某种 javascript 来获取填充页面所需的数据。我对那个数据很感兴趣。

If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.

如果我使用 HtmlAgilityPack 获取页面 - 脚本不会运行，所以我得到它基本上是一个空白页面。

Is there a way to force it to run a script, so I can get the data?

有没有办法强制它运行脚本，以便我可以获取数据？

Answer 1

采纳答案by Jamie Treworgy

You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.

您正在获取服务器返回的内容 - 与 Web 浏览器相同。当然，Web 浏览器会运行这些脚本。Html Agility Pack 只是一个 HTML 解析器 - 它无法解释 javascript 或将其绑定到文档的内部表示。如果您想运行该脚本，则需要一个 Web 浏览器。您的问题的完美答案将是一个完整的“无头”网络浏览器。它结合了 HTML 解析器、javascript 解释器和模拟浏览器 DOM 的模型，所有这些都协同工作。基本上，这是一个网络浏览器，除了它的渲染部分。目前还没有完全在 .NET 环境中工作的东西。

Your best bet is to use a WebBrowsercontrol and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.

最好的办法是使用WebBrowser控件并在 Internet Explorer 中在编程控制下实际加载和运行页面。这不会很快或很漂亮，但它会做你需要做的事情。

Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Netwhich discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.

另请参阅我对类似问题的回答：Load a DOM and Execute javascript, server side, with .Net讨论了 .NET 中可用的技术来执行此操作。不幸的是，大多数作品现在都存在，但还没有完全存在或没有以正确的方式集成。

Answer 2

回答by M?ns T?nneryd

You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.

您可以为此使用 Awesomium，http://www.awesomium.com/ 。它运行良好，但不支持 x64 并且不是线程安全的。我用它来 24x7 扫描一些网站，它至少连续几天运行良好，但通常会崩溃。

C# 在 HtmlAgilityPack 中运行脚本

提问by Aabela

采纳答案by Jamie Treworgy

回答by M?ns T?nneryd

相关推荐

最近更新

标签

C# 在 HtmlAgilityPack 中运行脚本

提问by Aabela

采纳答案by Jamie Treworgy

回答by M?ns T?nneryd

相关推荐

C# 获取字符串中第 n 次出现的字符的索引

C# 日期时间向上和向下舍入

c#使用默认应用程序和参数打开文件

C# 如何从 .NET 使用 Oracle？

相关推荐

最近更新

标签