使用 Java 抓取网页

Question

提问by NoneType

I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageIDand extract the HTML titles / other stuff in their DOM trees.

我找不到任何好的基于 Java 的 Web 抓取 API。我需要抓取的站点也没有提供任何 API；我想使用 some 遍历所有网页，pageID并在其 DOM 树中提取 HTML 标题/其他内容。

Are there ways other than web scraping?

除了网页抓取还有其他方法吗？

Answer 1

采纳答案by Wajdy Essam

jsoup

汤

Extracting the title is not difficult, and you have many options, search here on Stack Overflow for "Java HTML parsers". One of them is Jsoup.

提取标题并不困难，您有很多选择，在 Stack Overflow 上搜索“ Java HTML 解析器”。其中之一是Jsoup。

You can navigate the page using DOM if you know the page structure, see http://jsoup.org/cookbook/extracting-data/dom-navigation

如果您知道页面结构，您可以使用 DOM 导航页面，请参阅 http://jsoup.org/cookbook/extracting-data/dom-navigation

It's a good library and I've used it in my last projects.

这是一个很好的库，我在上一个项目中使用过它。

Answer 2

回答by Mikos

Look at an HTML parser such as TagSoup, HTMLCleaner or NekoHTML.

查看 HTML 解析器，例如 TagSoup、HTMLCleaner 或 NekoHTML。

Answer 3

回答by KJW

Your best bet is to use Selenium Web Driver since it

您最好的选择是使用 Selenium Web Driver，因为它

Provides visual feedback to the coder (see your scraping in action, see where it stops)
Accurate and Consistent as it directly controls the browser you use.
Slow. Doesn't hit web pages like HtmlUnit does but sometimes you don't want to hit too fast.
Htmlunit is fast but is horrible at handling Javascript and AJAX.

向编码员提供视觉反馈（查看您的抓取操作，查看它在哪里停止）
准确且一致，因为它直接控制您使用的浏览器。
减缓。不像 HtmlUnit 那样点击网页，但有时你不想点击太快。
Htmlunit 速度很快，但在处理 Javascript 和 AJAX 方面很糟糕。

Answer 4

回答by Beschi

HTMLUnitcan be used to do web scraping, it supports invoking pages, filling & submitting forms. I have used this in my project. It is good java library for web scraping. read here for more

HTMLUnit可用于进行网页抓取，它支持调用页面、填写和提交表单。我在我的项目中使用了它。它是用于网页抓取的良好 Java 库。在这里阅读更多

Answer 5

回答by user1374041

mechanize for Java would be a good fit for this, and as Wadjy Essam mentioned it uses JSoup for the HMLT. mechanize is a stageful HTTP/HTML client that supports navigation, form submissions, and page scraping.

Java 的机械化将非常适合这一点，正如 Wadjy Essam 提到的，它为 HMLT 使用 JSoup。mechanize 是一个分阶段的 HTTP/HTML 客户端，支持导航、表单提交和页面抓取。

http://gistlabs.com/software/mechanize-for-java/(and the GitHub here https://github.com/GistLabs/mechanize)

http://gistlabs.com/software/mechanize-for-java/（以及这里的 GitHub https://github.com/GistLabs/mechanize）

Answer 6

回答by Slavus

There is also Jaunt Java Web Scraping & JSON Querying - http://jaunt-api.com

还有 Jaunt Java Web Scraping & JSON Querying - http://jaunt-api.com

Answer 7

回答by Maithilish

If you wish to automate scraping of large amount pages or data, then you could try Gotz ETL.

如果您希望自动抓取大量页面或数据，那么您可以尝试Gotz ETL。

It is completely model driven like a real ETL tool. Data structure, task workflow and pages to scrape are defined with a set of XML definition files and no coding is required. Query can be written either using Selectors with JSoup or XPath with HtmlUnit.

它像真正的 ETL 工具一样完全由模型驱动。数据结构、任务工作流和要抓取的页面是用一组 XML 定义文件定义的，不需要编码。可以使用带有 JSoup 的选择器或带有 HtmlUnit 的 XPath 来编写查询。

Answer 8

回答by Louis-wht

You might look into jwht-scrapper!

你可能会研究jwht-scrapper！

This is a complete scrapping framework that has all the features a developper could expect from a web scrapper:

这是一个完整的抓取框架，具有开发人员可以从 Web 抓取工具中获得的所有功能：

It works with (jwht-htmltopojo)[https://github.com/whimtrip/jwht-htmltopojo) lib which itsef uses Jsoup mentionned by several other people here.

它与 (jwht-htmltopojo)[ https://github.com/whimtrip/jwht-htmltopojo) lib 一起使用，其中 itef 使用了其他几个人在这里提到的 Jsoup。

Together they will help you built awesome scrappers mapping directly HTML to POJOs and bypassing any classical scrapping problems in only a matter of minutes!

他们将一起帮助您构建出色的抓取工具，将 HTML 直接映射到 POJO，并在几分钟内绕过任何经典的抓取问题！

Hope this might help some people here!

希望这可以帮助这里的一些人！

Disclaimer, I am the one who developed it, feel free to let me know your remarks!

免责声明，我是开发它的人，请随时让我知道您的评论！

使用 Java 抓取网页

提问by NoneType

采纳答案by Wajdy Essam

jsoup

汤

回答by Mikos

回答by KJW

回答by Beschi

回答by user1374041

回答by Slavus

回答by Maithilish

回答by Louis-wht

相关推荐

最近更新

标签

使用 Java 抓取网页

提问by NoneType

采纳答案by Wajdy Essam

jsoup

汤

回答by Mikos

回答by KJW

回答by Beschi

回答by user1374041

回答by Slavus

回答by Maithilish

回答by Louis-wht

相关推荐

Java 是否可以声明 Supplier<T> 需要抛出异常？

如何使用 Java 在 JFrame 中的所需位置放置 JButton

Java Android 以编程方式包含布局（即没有 XML）

Java 删除不使用 JpaRepository

相关推荐

最近更新

标签