可以解释 JavaScript 的网络爬虫

Question

提问by user320662

I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM displayed on the browser when you 'view source' but can save the resulting HTML though Firebug.

我想编写一个可以解释 JavaScript 的网络爬虫。基本上它是一个 Java 或 PHP 程序，它以 URL 作为输入并输出类似于 Firebug HTML 窗口中输出的 DOM 树。最好的例子是 Kayak.com，当您“查看源代码”时，您看不到在浏览器上显示的结果 DOM，但可以通过 Firebug 保存结果 HTML。

How would I go about doing this? What tools exist that would help me?

我该怎么做呢？有哪些工具可以帮助我？

Answer 1

回答by tokland

Ruby's Capybarais an integration test library, but it can also be used to write stand-alone web-crawlers. Given that it uses backends like Selenium or headless WebKit, it interprets javascript out-of-the-box:

Ruby 的Capybara是一个集成测试库，但它也可用于编写独立的网络爬虫。鉴于它使用 Selenium 或 Headless WebKit 等后端，它可以开箱即用地解释 javascript：

require 'capybara/dsl'
require 'capybara-webkit'

include Capybara::DSL
Capybara.current_driver = :webkit
Capybara.app_host = "http://www.google.com"
page.visit("/")
puts(page.html)

Answer 2

回答by Jeff

I've been using HtmlUnit(Java). This was originally designed for unit testing pages. It's not perfect javascript, but it hasn't failed me in my limited usage. According to the site, it can run the following JS frameworks to a reasonable degree:

我一直在使用HtmlUnit(Java)。这最初是为单元测试页面设计的。它不是完美的 javascript，但在我有限的使用中它并没有让我失望。根据该站点，它可以在合理的程度上运行以下 JS 框架：

jQuery 1.2.6
MochiKit 1.4.1
GWT 2.0.0
Sarissa 0.9.9.3
MooTools 1.2.1
Prototype 1.6.0
Ext JS 2.2
Dojo 1.0.2
YUI 2.3.0

jQuery 1.2.6
MochiKit 1.4.1
GWT 2.0.0
莎丽莎 0.9.9.3
MooTools 1.2.1
原型 1.6.0
扩展JS 2.2
道场 1.0.2
YUI 2.3.0

Answer 3

回答by thomasrutter

You are more likely to have success in Java than in PHP. There is a pre-existing Javascript interpreter for Java called Rhino. It's a reference implementation, and well-documented.

使用 Java 比使用 PHP 更有可能取得成功。有一个预先存在的 Java Javascript 解释器，称为Rhino。这是一个参考实现，并且有据可查。

Rhino is used in lots of existing Java apps to provide Javascript scripting ability within the app. I have also heard of it used to assist with performing automated tests in Javascript.

Rhino 用于许多现有的 Java 应用程序，以在应用程序中提供 Javascript 脚本功能。我也听说过它用于协助在 Javascript 中执行自动化测试。

I also know that Java includes code that can parse and render HTML, though someone who knows more about Java than me can probably advise more on that. I am not denying it would be very difficult to achieve something like this; you'd essentially be re-implementing a lot of what a browser does.

我也知道 Java 包含可以解析和呈现 HTML 的代码，尽管比我更了解 Java 的人可能会在这方面提供更多建议。我并不否认实现这样的目标是非常困难的。你基本上会重新实现浏览器所做的很多事情。

Answer 4

回答by RoToRa

You could use Mozilla's rendering engine Gecko:

你可以使用 Mozilla 的渲染引擎 Gecko：

https://developer.mozilla.org/en/Gecko

Answer 5

回答by rollsappletree

Give a look here: http://snippets.scrapy.org/snippets/22/it's a python screen scraping and web crawling framework used with webdrivers that open a page, render all the things you need and gives you the possibilities to "capture" anything you want in the page via

看看这里：http: //snippets.scrapy.org/snippets/22/这是一个 python 屏幕抓取和网络爬行框架，与打开页面的 webdrivers 一起使用，呈现你需要的所有东西，并为你提供“捕获”的可能性" 任何你想要的页面通过

可以解释 JavaScript 的网络爬虫

提问by user320662

回答by tokland

回答by Jeff

回答by thomasrutter

回答by RoToRa

回答by rollsappletree

相关推荐

最近更新

标签

可以解释 JavaScript 的网络爬虫

提问by user320662

回答by tokland

回答by Jeff

回答by thomasrutter

回答by RoToRa

回答by rollsappletree

相关推荐

Javascript 使用 MomentJs 显示日期时间，无需时区转换

Javascript 无法使用自定义 PUBLIC_URL 构建 create-react-app 项目

Javascript 如何彻底销毁tinymce？

Javascript 如何在 react.js 中递归渲染子组件

相关推荐

最近更新

标签