java 网页抓取、屏幕抓取、数据挖掘技巧?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4079784/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Web scraping, screen scraping, data mining tips?
提问by JPC
I'm working on a project and I need to do a lot of screen scraping to get a lot of data as fast as possible. I'm wondering if anyone knows of any good API's or resources to help me out.
我正在处理一个项目,我需要进行大量屏幕抓取以尽快获取大量数据。我想知道是否有人知道任何好的 API 或资源来帮助我。
I'm using java, by the way.
顺便说一下,我正在使用 java。
Here's what my workflow has been so far:
到目前为止,这是我的工作流程:
- Connect to a website (using HTTPComponents from Apache)
- Website contains a section with a bunch of links that I need to visit (using built in java HTML parsers to figure out what all the links I need to visit are, this is annoying and messy code)
- Visit all the links that I found
- For each link that I visit, there's more data that I need to extract, spread out on multiple pages so I may need to visit more links
- 连接到网站(使用来自 Apache 的 HTTPComponents)
- 网站包含一个部分,里面有一堆我需要访问的链接(使用内置的 java HTML 解析器来找出我需要访问的所有链接,这是令人讨厌和凌乱的代码)
- 访问我找到的所有链接
- 对于我访问的每个链接,我需要提取更多数据并分布在多个页面上,因此我可能需要访问更多链接
Thoughts:
想法:
- Does anyone know of any higher level/more intelligent html parsers than the built in java one?
- Basically it's a depth first search. I imagine I would like to make this multithreaded at some time so I can visit some of these links in parallel.
- Maybe what I'm really looking for is a multithreaded web crawling library
- 有没有人知道比内置的 java 解析器更高级别/更智能的 html 解析器?
- 基本上这是一个深度优先搜索。我想我想在某个时候使这个多线程化,这样我就可以并行访问其中的一些链接。
- 也许我真正在寻找的是一个多线程的网络爬虫库
If you haven't figured out, this is my first time messing around with this so I'm having a difficult time trying to articulate exactly what my needs are. I would greatly appreciate any input that any of you who have done this before might have.
如果你还没有弄清楚,这是我第一次搞砸这个,所以我很难准确地表达我的需求。我将不胜感激你们之前做过这件事的任何人的任何意见。
回答by dogbane
I've found JSoupreally good for HTML parsing.
我发现JSoup非常适合 HTML 解析。
For more pointers check this article out: How to write a multi-threaded webcrawler
有关更多指针,请查看这篇文章:如何编写多线程网络爬虫
回答by harshit
回答by Boris Pavlovi?
Try using Web-Harvestproject.
尝试使用Web-Harvest项目。
回答by aldrinleal
Checkout JSR-237 for Work Management, which is a cool idea when going multithreaded.
Checkout JSR-237 for Work Management,在使用多线程时这是一个很酷的主意。
As for scraping, there are several alternatives. If ease of use is most important, I'd advise you to HTMLUnit. Beyond that, you must roll your own
至于刮擦,有几种选择。如果易用性是最重要的,我建议您使用 HTMLUnit。除此之外,你必须自己动手