java 使用无头浏览器进行 Android 网页抓取
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17399055/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Android Web Scraping with a Headless Browser
提问by Pierre
I have spent a day on researching a library that can be used to accomplish the following:
我花了一天时间研究一个可用于完成以下任务的库:
- Retrieve the full contents of a webpage like in the background without rendering result to a view.
- The lib should support pages that fires off ajax requests to load some additional result data after the initial HTML has loaded for example.
- From the resulting html I need to grab elements in xpath or css selector form.
- In future I also possibly need to navigate to a next page (fire off events, submitting buttons/links etc)
- 像在后台一样检索网页的完整内容,而无需将结果呈现到视图中。
- 例如,lib 应该支持在初始 HTML 加载后触发 ajax 请求以加载一些额外结果数据的页面。
- 从生成的 html 中,我需要以 xpath 或 css 选择器形式获取元素。
- 将来我还可能需要导航到下一页(触发事件、提交按钮/链接等)
Here is what I have tried without success:
这是我尝试过但没有成功的方法:
- Jsoup: Works great but no support for javascript/ajax (so it does not load full page)
- Android built in HttpEntity: same problem with javascript/ajax as jsoup
- HtmlUnit: Looks exactly what I need but after hours cannot get it to work on Android (Other users failed by trying to load the 12MB+ worth of jar files. I myself loaded the full source code and referenced it as a project library only to find that things such as Applets and java.awt (used by HtmlUnit) does not exist in Android).
- Rhino - I find this very confusing and don't know how to get it working in Android and even if it is what I am looking for.
- Selenium Driver: Looks like it can work but you don't have an straightforward way to implement it in a headless way so that you don't have the actual html displayed to a view.
- Jsoup:效果很好,但不支持 javascript/ajax(所以它不会加载整页)
- Android 内置于 HttpEntity:javascript/ajax 与 jsoup 相同的问题
- HtmlUnit:看起来正是我需要的,但在几个小时后无法让它在 Android 上运行(其他用户尝试加载 12MB+ 的 jar 文件失败了。我自己加载了完整的源代码并将其作为项目库引用只是为了发现Android 中不存在 Applets 和 java.awt(由 HtmlUnit 使用)之类的东西)。
- Rhino - 我觉得这很令人困惑,不知道如何让它在 Android 中工作,即使它是我正在寻找的。
- Selenium 驱动程序:看起来它可以工作,但您没有直接的方法以无头方式实现它,因此您没有将实际的 html 显示到视图中。
I really want HtmlUnit to work as it seems the best suited for my solution. Is there any way or at least another library I have missed that is suitable for my needs?
我真的希望 HtmlUnit 能够工作,因为它似乎最适合我的解决方案。有什么方法或至少我错过了另一个适合我需要的图书馆吗?
I am currently using Android Studio 0.1.7 and can move to Ellipse if needed.
我目前使用的是 Android Studio 0.1.7,如果需要可以转移到 Ellipse。
Thanks in advance!
提前致谢!
采纳答案by Pierre
Ok after 2 weeks I admit defeat and are using a workaround which works great for me at the moment.
好的 2 周后我承认失败并正在使用一种目前对我很有用的解决方法。
The problem:
It is too difficult to port HTMLUnit to Android (or at least with my level of expertise). I am sure its a worthwhile project (and not that time consuming for experienced java programmer) . I emailed the guys at HTMLUnit and they commented that they are not looking into a port or what effort will be involved but suggested anyone who wants to start with such a project should send an message to their mailing list to get more developers involved (http://htmlunit.sourceforge.net/mail-lists.html).
问题:
将 HTMLUnit 移植到 Android 太困难了(或者至少以我的专业水平)。我确信它是一个值得的项目(对于有经验的 Java 程序员来说并不费时)。我给 HTMLUnit 的人发了电子邮件,他们评论说他们没有研究移植或将涉及什么工作,但建议任何想要开始这样一个项目的人都应该向他们的邮件列表发送消息,让更多的开发人员参与其中(http: //htmlunit.sourceforge.net/mail-lists.html)。
The workaround:
I used android's built in WebView and overrided the onPageFinished method of Webview class to inject Javascript that grabs all the html after the page has fully loaded. Webview can also be used to called futher javascript actions, clicking buttons, filling in forms etc.
解决方法:
我使用android内置的WebView并覆盖Webview类的onPageFinished方法来注入Javascript,在页面完全加载后抓取所有的html。Webview 还可以用于调用其他 javascript 操作、单击按钮、填写表单等。
Code:
代码:
webView.getSettings().setJavaScriptEnabled(true);
MyJavaScriptInterface jInterface = new MyJavaScriptInterface(context);
webView.addJavascriptInterface(jInterface, "HtmlViewer");
webView.setWebViewClient(new WebViewClient() {
@Override
public void onPageFinished(WebView view, String url) {
//Load HTML
webView.loadUrl("javascript:window.HtmlViewer.showHTML
('<head>'+document.getElementsByTagName('html')[0].innerHTML+'</head>');");
}
webView.loadUrl(StartURL);
ParseHtml(jInterface.html);
public class MyJavaScriptInterface {
private Context ctx;
public String html;
MyJavaScriptInterface(Context ctx) {
this.ctx = ctx;
}
@JavascriptInterface
public void showHTML(String _html) {
html = _html;
}
}
回答by bluiska
I have taken the implementation mentioned above (injecting JavaScript) and that works for me. All I do is simply set the visibility of the webview to be hidden under other UI elements. I was also thinking of doing the same with selenium. I have used selenium with Chrome in Python and it's great but like you mentioned it is not easy to not show the browser window. But I think it might be possible to just not show the component in Android. I'll have to try.
我已经采用了上面提到的实现(注入 JavaScript),这对我有用。我所做的只是将 webview 的可见性设置为隐藏在其他 UI 元素下。我也在考虑用硒做同样的事情。我在 Python 中将 selenium 与 Chrome 一起使用,它很棒,但就像你提到的那样,不显示浏览器窗口并不容易。但我认为有可能在 Android 中不显示该组件。我得试试。