使用 Java 抓取网页
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3202305/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Web scraping with Java
提问by NoneType
I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageIDand extract the HTML titles / other stuff in their DOM trees.
我找不到任何好的基于 Java 的 Web 抓取 API。我需要抓取的站点也没有提供任何 API;我想使用 some 遍历所有网页,pageID并在其 DOM 树中提取 HTML 标题/其他内容。
Are there ways other than web scraping?
除了网页抓取还有其他方法吗?
采纳答案by Wajdy Essam
jsoup
汤
Extracting the title is not difficult, and you have many options, search here on Stack Overflow for "Java HTML parsers". One of them is Jsoup.
提取标题并不困难,您有很多选择,在 Stack Overflow 上搜索“ Java HTML 解析器”。其中之一是Jsoup。
You can navigate the page using DOM if you know the page structure, see http://jsoup.org/cookbook/extracting-data/dom-navigation
如果您知道页面结构,您可以使用 DOM 导航页面,请参阅 http://jsoup.org/cookbook/extracting-data/dom-navigation
It's a good library and I've used it in my last projects.
这是一个很好的库,我在上一个项目中使用过它。
回答by Mikos
Look at an HTML parser such as TagSoup, HTMLCleaner or NekoHTML.
查看 HTML 解析器,例如 TagSoup、HTMLCleaner 或 NekoHTML。
回答by KJW
Your best bet is to use Selenium Web Driver since it
您最好的选择是使用 Selenium Web Driver,因为它
- Provides visual feedback to the coder (see your scraping in action, see where it stops)
- Accurate and Consistent as it directly controls the browser you use.
Slow. Doesn't hit web pages like HtmlUnit does but sometimes you don't want to hit too fast.
Htmlunit is fast but is horrible at handling Javascript and AJAX.
- 向编码员提供视觉反馈(查看您的抓取操作,查看它在哪里停止)
- 准确且一致,因为它直接控制您使用的浏览器。
减缓。不像 HtmlUnit 那样点击网页,但有时你不想点击太快。
Htmlunit 速度很快,但在处理 Javascript 和 AJAX 方面很糟糕。
回答by Beschi
HTMLUnitcan be used to do web scraping, it supports invoking pages, filling & submitting forms. I have used this in my project. It is good java library for web scraping. read here for more
HTMLUnit可用于进行网页抓取,它支持调用页面、填写和提交表单。我在我的项目中使用了它。它是用于网页抓取的良好 Java 库。 在这里阅读更多
回答by user1374041
mechanize for Java would be a good fit for this, and as Wadjy Essam mentioned it uses JSoup for the HMLT. mechanize is a stageful HTTP/HTML client that supports navigation, form submissions, and page scraping.
Java 的机械化将非常适合这一点,正如 Wadjy Essam 提到的,它为 HMLT 使用 JSoup。mechanize 是一个分阶段的 HTTP/HTML 客户端,支持导航、表单提交和页面抓取。
http://gistlabs.com/software/mechanize-for-java/(and the GitHub here https://github.com/GistLabs/mechanize)
http://gistlabs.com/software/mechanize-for-java/(以及这里的 GitHub https://github.com/GistLabs/mechanize)
回答by Slavus
There is also Jaunt Java Web Scraping & JSON Querying - http://jaunt-api.com
还有 Jaunt Java Web Scraping & JSON Querying - http://jaunt-api.com
回答by Maithilish
If you wish to automate scraping of large amount pages or data, then you could try Gotz ETL.
如果您希望自动抓取大量页面或数据,那么您可以尝试Gotz ETL。
It is completely model driven like a real ETL tool. Data structure, task workflow and pages to scrape are defined with a set of XML definition files and no coding is required. Query can be written either using Selectors with JSoup or XPath with HtmlUnit.
它像真正的 ETL 工具一样完全由模型驱动。数据结构、任务工作流和要抓取的页面是用一组 XML 定义文件定义的,不需要编码。可以使用带有 JSoup 的选择器或带有 HtmlUnit 的 XPath 来编写查询。
回答by Louis-wht
You might look into jwht-scrapper!
你可能会研究jwht-scrapper!
This is a complete scrapping framework that has all the features a developper could expect from a web scrapper:
这是一个完整的抓取框架,具有开发人员可以从 Web 抓取工具中获得的所有功能:
- Proxy support
- Warning Sign Support to detect captchas and more
- Complex link following features
- Multithreading
- Various scrapping delays when required
- Rotating User-Agent
- Request auto retry and HTTP redirections supports
- HTTP headers, cookies and more support
- GET and POST support
- Annotation Configuration
- Detailed Scrapping Metrics
- Async handling of the scrapper client
- jwht-htmltopojo fully featured framework to map HTML to POJO
- Custom Input Format handling and built in JSON -> POJO mapping
- Full Exception Handling Control
- Detailed Logging with log4j
- POJO injection
- Custom processing hooks
- Easy to use and well documented API
- 代理支持
- 警告标志支持检测验证码等
- 复杂的链接跟随功能
- 多线程
- 需要时的各种报废延迟
- 轮换用户代理
- 请求自动重试和 HTTP 重定向支持
- HTTP 标头、cookie 和更多支持
- GET 和 POST 支持
- 注解配置
- 详细的报废指标
- 刮板客户端的异步处理
- jwht-htmltopojo 功能齐全的框架,用于将 HTML 映射到 POJO
- 自定义输入格式处理和内置 JSON -> POJO 映射
- 完整的异常处理控制
- 使用 log4j 进行详细日志记录
- POJO注入
- 定制加工挂钩
- 易于使用且文档齐全的 API
It works with (jwht-htmltopojo)[https://github.com/whimtrip/jwht-htmltopojo) lib which itsef uses Jsoup mentionned by several other people here.
它与 (jwht-htmltopojo)[ https://github.com/whimtrip/jwht-htmltopojo) lib 一起使用,其中 itef 使用了其他几个人在这里提到的 Jsoup。
Together they will help you built awesome scrappers mapping directly HTML to POJOs and bypassing any classical scrapping problems in only a matter of minutes!
他们将一起帮助您构建出色的抓取工具,将 HTML 直接映射到 POJO,并在几分钟内绕过任何经典的抓取问题!
Hope this might help some people here!
希望这可以帮助这里的一些人!
Disclaimer, I am the one who developed it, feel free to let me know your remarks!
免责声明,我是开发它的人,请随时让我知道您的评论!

