使用 java 的 Web 爬网(启用 Ajax/JavaScript 的页面)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24365154/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Web Crawling (Ajax/JavaScript enabled pages) using java
提问by Amar
I am very new to this web crawling. I am using crawler4jto crawl the websites. I am collecting the required information by crawling these sites. My problem here is I was unable to crawl the content for the following site. http://www.sciencedirect.com/science/article/pii/S1568494612005741. I want to crawl the following information from the aforementioned site (Please take a look at the attached screenshot).
我对这个网络爬行很陌生。我正在使用crawler4j来抓取网站。我正在通过抓取这些网站来收集所需的信息。我的问题是我无法抓取以下站点的内容。http://www.sciencedirect.com/science/article/pii/S1568494612005741。我想从上述网站抓取以下信息(请看附件截图)。
If you observe the attached screenshot it has three names (Highlighted in red boxes). If you click one of the link you will see a popup and that popup contains the whole information about that author. I want to crawl the information which are there in that popup.
如果您观察附加的屏幕截图,它有三个名称(以红色框突出显示)。如果您单击其中一个链接,您将看到一个弹出窗口,该弹出窗口包含有关该作者的全部信息。我想抓取该弹出窗口中的信息。
I am using the following code to crawl the content.
我正在使用以下代码来抓取内容。
public class WebContentDownloader {
private Parser parser;
private PageFetcher pageFetcher;
public WebContentDownloader() {
CrawlConfig config = new CrawlConfig();
parser = new Parser(config);
pageFetcher = new PageFetcher(config);
}
private Page download(String url) {
WebURL curURL = new WebURL();
curURL.setURL(url);
PageFetchResult fetchResult = null;
try {
fetchResult = pageFetcher.fetchHeader(curURL);
if (fetchResult.getStatusCode() == HttpStatus.SC_OK) {
try {
Page page = new Page(curURL);
fetchResult.fetchContent(page);
if (parser.parse(page, curURL.getURL())) {
return page;
}
} catch (Exception e) {
e.printStackTrace();
}
}
} finally {
if (fetchResult != null) {
fetchResult.discardContentIfNotConsumed();
}
}
return null;
}
private String processUrl(String url) {
System.out.println("Processing: " + url);
Page page = download(url);
if (page != null) {
ParseData parseData = page.getParseData();
if (parseData != null) {
if (parseData instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) parseData;
return htmlParseData.getHtml();
}
} else {
System.out.println("Couldn't parse the content of the page.");
}
} else {
System.out.println("Couldn't fetch the content of the page.");
}
return null;
}
public String getHtmlContent(String argUrl) {
return this.processUrl(argUrl);
}
}
I was able to crawl the content from the aforementioned link/site. But it doesn't have the information what I marked in the red boxes. I think those are the dynamic links.
我能够从上述链接/站点中抓取内容。但它没有我在红框中标记的信息。我认为这些是动态链接。
- My question is how can I crawl the content from the aforementioned link/website...???
- How to crawl the content from Ajax/JavaScript based websites...???
- 我的问题是如何从上述链接/网站中抓取内容......???
- 如何从基于 Ajax/JavaScript 的网站抓取内容......???
Please can anyone help me on this.
请任何人都可以帮助我。
Thanks & Regards, Amar
感谢和问候, 阿马尔
采纳答案by Amar
Hi I found the workaround with the another library. I used Selinium WebDriver (org.openqa.selenium.WebDriver)library to extract the dynamic content. Here is the sample code.
嗨,我找到了另一个图书馆的解决方法。我使用 Selinium WebDriver (org.openqa.selenium.WebDriver)库来提取动态内容。这是示例代码。
public class CollectUrls {
private WebDriver driver;
public CollectUrls() {
this.driver = new FirefoxDriver();
this.driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
}
protected void next(String url, List<String> argUrlsList) {
this.driver.get(url);
String htmlContent = this.driver.getPageSource();
}
Here the "htmlContent" is the required one. Please let me know if you face any issues...???
这里“ htmlContent”是必需的。如果您遇到任何问题,请告诉我......?
Thanks, Amar
谢谢,阿马尔
回答by Erwin
Simply said, Crawler4j is static crawler. Meaning that it can't parse the JavaScript on a page. So there is no way of getting the content you want by crawling that specific page you mentioned. Of course there are some workarounds to get it working.
简单的说,Crawler4j 是静态爬虫。这意味着它无法解析页面上的 JavaScript。因此,无法通过抓取您提到的特定页面来获取您想要的内容。当然,有一些解决方法可以让它工作。
If it is just this page you want to crawl, you could use a connection debugger. Check out this questionfor some tools. Find out which page the AJAX-request calls, and crawl that page.
如果您只想抓取该页面,则可以使用连接调试器。查看此问题以获取一些工具。找出 AJAX 请求调用的页面,并抓取该页面。
If you have various websites which have dynamic content (JavaScript/ajax), you should consider using a dynamic-content-enabled crawler, like Crawljax(also written in Java).
如果您有各种具有动态内容(JavaScript/ajax)的网站,您应该考虑使用启用动态内容的爬虫,如Crawljax(也是用 Java 编写的)。
回答by BasK
I have find out the Solution of Dynamic Web page Crawling using Aperture and Selenium.Web Driver.
Aperture is Crawling Tools and Selenium is Testing Tools which can able to rendering Inspect Element.
1. Extract the Aperture- core Jar file by Decompiler Tools and Create a Simple Web Crawling Java program. (https://svn.code.sf.net/p/aperture/code/aperture/trunk/)
2. Download Selenium. WebDriver Jar Files and Added to Your Program.
3. Go to CreatedDataObjec() method in org.semanticdesktop.aperture.accessor.http.HttpAccessor.(Aperture Decompiler).
Added Below Coding
WebDriver driver = new FirefoxDriver();
String baseurl=uri.toString();
driver.get(uri.toString());
String str = driver.getPageSource();
driver.close();
stream= new ByteArrayInputStream(str.getBytes());