java 网页抓取、屏幕抓取、数据挖掘技巧？

Question

提问by JPC

I'm working on a project and I need to do a lot of screen scraping to get a lot of data as fast as possible. I'm wondering if anyone knows of any good API's or resources to help me out.

我正在处理一个项目，我需要进行大量屏幕抓取以尽快获取大量数据。我想知道是否有人知道任何好的 API 或资源来帮助我。

I'm using java, by the way.

顺便说一下，我正在使用 java。

Here's what my workflow has been so far:

到目前为止，这是我的工作流程：

Connect to a website (using HTTPComponents from Apache)
Website contains a section with a bunch of links that I need to visit (using built in java HTML parsers to figure out what all the links I need to visit are, this is annoying and messy code)
Visit all the links that I found
For each link that I visit, there's more data that I need to extract, spread out on multiple pages so I may need to visit more links

连接到网站（使用来自 Apache 的 HTTPComponents）
网站包含一个部分，里面有一堆我需要访问的链接（使用内置的 java HTML 解析器来找出我需要访问的所有链接，这是令人讨厌和凌乱的代码）
访问我找到的所有链接
对于我访问的每个链接，我需要提取更多数据并分布在多个页面上，因此我可能需要访问更多链接

Thoughts:

想法：

Does anyone know of any higher level/more intelligent html parsers than the built in java one?
Basically it's a depth first search. I imagine I would like to make this multithreaded at some time so I can visit some of these links in parallel.
Maybe what I'm really looking for is a multithreaded web crawling library

有没有人知道比内置的 java 解析器更高级别/更智能的 html 解析器？
基本上这是一个深度优先搜索。我想我想在某个时候使这个多线程化，这样我就可以并行访问其中的一些链接。
也许我真正在寻找的是一个多线程的网络爬虫库

If you haven't figured out, this is my first time messing around with this so I'm having a difficult time trying to articulate exactly what my needs are. I would greatly appreciate any input that any of you who have done this before might have.

如果你还没有弄清楚，这是我第一次搞砸这个，所以我很难准确地表达我的需求。我将不胜感激你们之前做过这件事的任何人的任何意见。

Answer 1

回答by dogbane

I've found JSoupreally good for HTML parsing.

我发现JSoup非常适合 HTML 解析。

For more pointers check this article out: How to write a multi-threaded webcrawler

有关更多指针，请查看这篇文章：如何编写多线程网络爬虫

Answer 2

回答by harshit

I used Bixofor extracting the hyperlinks and images doing depth search,. It built over hadoop and cascading so there is a learning curve but the example provided is good enough to config the changes ...

我使用 Bixo提取超链接和图像进行深度搜索。它建立在 hadoop 和级联之上，因此有一个学习曲线，但提供的示例足以配置更改......

Answer 3

回答by Boris Pavlovi?

Try using Web-Harvestproject.

尝试使用Web-Harvest项目。

Answer 4

回答by aldrinleal

Checkout JSR-237 for Work Management, which is a cool idea when going multithreaded.

Checkout JSR-237 for Work Management，在使用多线程时这是一个很酷的主意。

As for scraping, there are several alternatives. If ease of use is most important, I'd advise you to HTMLUnit. Beyond that, you must roll your own

至于刮擦，有几种选择。如果易用性是最重要的，我建议您使用 HTMLUnit。除此之外，你必须自己动手

java 网页抓取、屏幕抓取、数据挖掘技巧？

提问by JPC

回答by dogbane

回答by harshit

回答by Boris Pavlovi?

回答by aldrinleal

相关推荐

最近更新

标签

java 网页抓取、屏幕抓取、数据挖掘技巧？

提问by JPC

回答by dogbane

回答by harshit

回答by Boris Pavlovi?

回答by aldrinleal

相关推荐

java java中有哪些可用的单元测试框架？

为什么 java.sql.DriverManager.getConnection(...) 挂了？

java 如何在 application-context.xml 中设置局部变量来表示重复值？

Java 数组中的最大维数

相关推荐

最近更新

标签