你如何抓取 AJAX 页面?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/260540/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do you scrape AJAX pages?
提问by xxxxxxx
Please advise how to scrape AJAX pages.
请告知如何抓取 AJAX 页面。
回答by Brian R. Bondy
Overview:
概述:
All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.
所有屏幕抓取首先需要手动检查要从中提取资源的页面。在处理 AJAX 时,您通常只需要分析更多的内容,而不仅仅是简单的 HTML。
When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.
在处理 AJAX 时,这仅意味着您想要的值不在您请求的初始 HTML 文档中,而是 javascript 将被执行,它会向服务器询问您想要的额外信息。
You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.
因此,您通常可以简单地分析 javascript 并查看 javascript 发出的请求,然后从一开始就调用此 URL。
Example:
例子:
Take this as an example, assume the page you want to scrape from has the following script:
以此为例,假设您要抓取的页面具有以下脚本:
<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
{
// Firefox, Opera 8.0+, Safari
xmlHttp=new XMLHttpRequest();
}
catch (e)
{
// Internet Explorer
try
{
xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
}
catch (e)
{
try
{
xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
}
catch (e)
{
alert("Your browser does not support AJAX!");
return false;
}
}
}
xmlHttp.onreadystatechange=function()
{
if(xmlHttp.readyState==4)
{
document.myForm.time.value=xmlHttp.responseText;
}
}
xmlHttp.open("GET","time.asp",true);
xmlHttp.send(null);
}
</script>
Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.
然后,您需要做的就是向同一服务器的 time.asp 发出 HTTP 请求。 来自 w3schools 的示例。
Advanced scraping with C++:
使用 C++ 进行高级抓取:
For complex usage, and if you're using C++ you could also consider using the firefox javascript engine SpiderMonkeyto execute the javascript on a page.
对于复杂的使用,如果您使用 C++,您还可以考虑使用 firefox javascript 引擎SpiderMonkey在页面上执行 javascript。
Advanced scraping with Java:
使用 Java 进行高级抓取:
For complex usage, and if you're using Java you could also consider using the firefox javascript engine for Java Rhino
对于复杂的使用,如果您使用的是 Java,您还可以考虑为 Java Rhino使用 firefox javascript 引擎
Advanced scraping with .NET:
使用 .NET 进行高级抓取:
For complex usage, and if you're using .Net you could also consider using the Microsoft.vsa assembly. Recently replaced with ICodeCompiler/CodeDOM.
对于复杂的使用,如果您使用 .Net,您还可以考虑使用 Microsoft.vsa 程序集。最近替换为 ICodeCompiler/CodeDOM。
回答by mattspain
In my opinion the simpliest solution is to use Casperjs, a framework based on the WebKit headless browser phantomjs.
在我看来,最简单的解决方案是使用Casperjs,这是一个基于 WebKit 无头浏览器 phantomjs 的框架。
The whole page is loaded, and it's very easy to scrape any ajax-related data. You can check this basic tutorial to learn Automating & Scraping with PhantomJS and CasperJS
整个页面都被加载了,很容易抓取任何与ajax相关的数据。您可以查看此基本教程以学习使用 PhantomJS 和 CasperJS 进行自动化和抓取
You can also give a look at this example code, on how to scrape google suggests keywords :
您还可以查看此示例代码,了解如何抓取 google 建议关键字:
/*global casper:true*/
var casper = require('casper').create();
var suggestions = [];
var word = casper.cli.get(0);
if (!word) {
casper.echo('please provide a word').exit(1);
}
casper.start('http://www.google.com/', function() {
this.sendKeys('input[name=q]', word);
});
casper.waitFor(function() {
return this.fetchText('.gsq_a table span').indexOf(word) === 0
}, function() {
suggestions = this.evaluate(function() {
var nodes = document.querySelectorAll('.gsq_a table span');
return [].map.call(nodes, function(node){
return node.textContent;
});
});
});
casper.run(function() {
this.echo(suggestions.join('\n')).exit();
});
回答by sblundy
回答by sw.
The best way to scrape web pages using Ajax or in general pages using Javascript is with a browser itself or a headless browser (a browser without GUI). Currently phantomjsis a well promoted headless browser using WebKit. An alternative that I used with success is HtmlUnit(in Java or .NET via IKVM, which is a simulated browser. Another known alternative is using a web automation tool like Selenium.
使用 Ajax 抓取网页或使用 Javascript 抓取网页的最佳方法是使用浏览器本身或无头浏览器(没有 GUI 的浏览器)。目前phantomjs是一个使用 WebKit 的推广良好的无头浏览器。我成功使用的替代方案是HtmlUnit(在 Java 或 .NET 中,通过IKVM,这是一个模拟浏览器。另一个已知的替代方案是使用像Selenium这样的网络自动化工具。
I wrote many articles about this subject like web scraping Ajax and Javascript sitesand automated browserless OAuth authentication for Twitter. At the end of the first article there are a lot of extra resources that I have been compiling since 2011.
我写了很多关于这个主题的文章,比如网页抓取 Ajax 和 Javascript 站点以及Twitter 的自动无浏览器 OAuth 身份验证。在第一篇文章的末尾有很多我从2011年以来一直在编译的额外资源。
回答by yxc
I think Brian R. Bondy's answer is useful when the source code is easy to read. I prefer an easy way using tools like Wireshark or HttpAnalyzer to capture the packet and get the url from the "Host" field and the "GET" field.
我认为当源代码易于阅读时,Brian R. Bondy 的回答很有用。我更喜欢使用 Wireshark 或 HttpAnalyzer 等工具来捕获数据包并从“主机”字段和“获取”字段获取 url 的简单方法。
For example,I capture a packet like the following:
例如,我捕获如下数据包:
GET /hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330
HTTP/1.1
Accept: */*
Referer: http://quote.hexun.com/stock/default.aspx
Accept-Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Host: quote.tool.hexun.com
Connection: Keep-Alive
Then the URL is :
然后网址是:
http://quote.tool.hexun.com/hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330
回答by TTT
回答by wonderchook
Depends on the ajax page. The first part of screen scraping is determining how the page works. Is there some sort of variable you can iterate through to request all the data from the page? Personally I've used Web Scraper Plusfor a lot of screen scraping related tasks because it is cheap, not difficult to get started, non-programmers can get it working relatively quickly.
取决于 ajax 页面。屏幕抓取的第一部分是确定页面的工作方式。是否有某种变量可以迭代以请求页面中的所有数据?就个人而言,我使用Web Scraper Plus完成了许多与屏幕抓取相关的任务,因为它便宜、不难上手,非程序员可以相对较快地使其工作。
Side Note: Terms of Use is probably somewhere you might want to check before doing this. Depending on the site iterating through everything may raise some flags.
旁注:使用条款可能是您在执行此操作之前可能需要检查的地方。根据站点的不同,遍历所有内容可能会引发一些标志。
回答by Alex
As a low cost solution you can also try SWExplorerAutomation(SWEA). The program creates an automation API for any Web application developed with HTML, DHTML or AJAX.
作为低成本解决方案,您还可以尝试SWExplorerAutomation(SWEA)。该程序为任何使用 HTML、DHTML 或 AJAX 开发的 Web 应用程序创建自动化 API。
回答by hekimgil
Selenium WebDriver is a good solution: you program a browser and you automate what needs to be done in the browser. Browsers (Chrome, Firefox, etc) provide their own drivers that work with Selenium. Since it works as an automated REAL browser, the pages (including javascript and Ajax) get loaded as they do with a human using that browser.
Selenium WebDriver 是一个很好的解决方案:您对浏览器进行编程并自动执行需要在浏览器中完成的操作。浏览器(Chrome、Firefox 等)提供了自己的可与 Selenium 配合使用的驱动程序。由于它作为一个自动化的REAL 浏览器工作,页面(包括 javascript 和 Ajax)被加载,就像人类使用该浏览器一样。
The downside is that it is slow (since you would most probably like to wait for all images and scripts to load before you do your scraping on that single page).
缺点是它很慢(因为您很可能希望在对该单个页面进行抓取之前等待所有图像和脚本加载完毕)。
回答by Deepan Prabhu Babu
I have previously linked to MIT's solvent and EnvJS as my answers to scrape off Ajax pages. These projects seem no longer accessible.
我之前已经链接到 MIT 的溶剂和 EnvJS 作为我的答案来刮掉 Ajax 页面。这些项目似乎不再可用。
Out of sheer necessity, I have invented another way to actually scrape off Ajax pages, and it has worked for tough sites like findthecompany which have methods to find headless javascript engines and show no data.
出于纯粹的需要,我发明了另一种实际刮掉 Ajax 页面的方法,它适用于诸如 findthecompany 之类的困难站点,这些站点具有查找无头 javascript 引擎并且不显示数据的方法。
The technique is to use chrome extensions to do scraping. Chrome extensions are the best place to scrape off Ajax pages because they actually allow us access to javascript modified DOM. The technique is as follows, I will certainly open source the code in sometime. Create a chrome extension ( assuming you know how to create one, and its architecture and capabilities. This is easy to learn and practice as there are lots of samples),
该技术是使用 chrome 扩展程序进行抓取。Chrome 扩展程序是刮掉 Ajax 页面的最佳位置,因为它们实际上允许我们访问 javascript 修改的 DOM。技术如下,有时间我一定会开源。创建一个 chrome 扩展(假设你知道如何创建一个,以及它的架构和功能。这很容易学习和练习,因为有很多示例),
- Use content scripts to access the DOM, by using xpath. Pretty much get the entire list or table or dynamically rendered content using xpath into a variable as string HTML Nodes. ( Only content scripts can access DOM but they can't contact a URL using XMLHTTP )
- From content script, using message passing, message the entire stripped DOM as string, to a background script. ( Background scripts can talk to URLs but can't touch the DOM ). We use message passing to get these to talk.
- You can use various events to loop through web pages and pass each stripped HTML Node content to the background script.
- Now use the background script, to talk to an external server (on localhost), a simple one created using Nodejs/python. Just send the entire HTML Nodes as string, to the server, where the server would just persist the content posted to it, into files, with appropriate variables to identify page numbers or URLs.
- Now you have scraped AJAX content ( HTML Nodes as string ), but these are partial html nodes. Now you can use your favorite XPATH library to load these into memory and use XPATH to scrape information into Tables or text.
- 使用内容脚本通过 xpath 访问 DOM。几乎将整个列表或表格或使用 xpath 动态呈现的内容作为字符串 HTML 节点转换为变量。(只有内容脚本可以访问 DOM,但他们不能使用 XMLHTTP 联系 URL)
- 从内容脚本,使用消息传递,将整个剥离的 DOM 作为字符串发送到后台脚本。(后台脚本可以与 URL 通信,但不能接触 DOM )。我们使用消息传递来让这些对话。
- 您可以使用各种事件循环浏览网页并将每个剥离的 HTML 节点内容传递给后台脚本。
- 现在使用后台脚本与外部服务器(在本地主机上)通信,这是一个使用 Nodejs/python 创建的简单服务器。只需将整个 HTML 节点作为字符串发送到服务器,服务器会将发布到它的内容保存到文件中,并使用适当的变量来识别页码或 URL。
- 现在您已经抓取了 AJAX 内容(HTML 节点作为字符串),但这些是部分 html 节点。现在您可以使用您最喜欢的 XPATH 库将这些加载到内存中,并使用 XPATH 将信息抓取到表格或文本中。
Please comment if you cant understand and I can write it better. ( first attempt ). Also, I am trying to release sample code as soon as possible.
如果你不能理解,请评论,我可以写得更好。( 第一次尝试 )。此外,我正在尝试尽快发布示例代码。

