一个非常简单的 C++ 网络爬虫/蜘蛛?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4278024/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 15:00:59  来源:igfitidea点击:

A very simple C++ web crawler/spider?

c++web-crawler

提问by popurity09

I am trying to do a very simple web crawler/spider app in C++. I have been searched google for a simple one to understand the concept. And I found this:

我正在尝试用 C++ 做一个非常简单的网络爬虫/蜘蛛应用程序。我已经在谷歌上搜索了一个简单的来理解这个概念。我发现了这个:

http://www.example-code.com/vcpp/spider.asp

http://www.example-code.com/vcpp/spider.asp

But, its kinda bit complicated/hard to digest for me.

但是,对我来说有点复杂/难以消化。

What I am trying to do is just, for example:

我想要做的只是,例如:

enter the url: www.example.com (i will use bash->wget, to get the contents/source code)

输入网址:www.example.com(我将使用 bash->wget,以获取内容/源代码)

then, will look for, maybe "a href" link, and then store in some data file.

然后,会寻找,也许是“a href”链接,然后存储在某个数据文件中。

Any simple tutorial, or guidelines for me?

任何简单的教程,或我的指导方针?

I am just starting learning C++ (1 month)

我刚开始学习 C++(1 个月)

回答by Charles Salvia

All right, I'll try to point you in the right direction. Conceptually, a webcrawler is pretty simple. It revolves around a FIFO queue data structure which stores pending URLs. C++ has a built-in queue structure in the standard libary, std::queue, which you can use to store URLs as strings.

好的,我会尽力为您指明正确的方向。从概念上讲,网络爬虫非常简单。它围绕存储待处理 URL 的 FIFO 队列数据结构展开。C++ 在标准库中具有一个内置的队列结构std::queue,您可以使用它来将 URL 存储为字符串。

The basic algorithm is pretty straightforward:

基本算法非常简单:

  1. Begin with a base URL that you select, and place it on the top of your queue
  2. Pop the URL at the top of the queue and download it
  3. Parse the downloaded HTML file and extract all links
  4. Insert each extracted link into the queue
  5. Goto step 2, or stop once you reach some specified limit
  1. 从您选择的基本 URL 开始,并将其放在队列的顶部
  2. 弹出队列顶部的 URL 并下载它
  3. 解析下载的 HTML 文件并提取所有链接
  4. 将每个提取的链接插入队列
  5. 转到第 2 步,或在达到某个指定限制后停止

Now, I said that a webcrawler is conceptuallysimple, but implementing it is not so simple. As you can see from the above algorithm, you'll need: an HTTP networking library to allow you to download URLs, anda good HTML parser that will let you extract links. You mentioned you could use wgetto download pages. That simplifies things somewhat, but you still need to actually parse the downloaded HTML docs. Parsing HTML correctly is a non-trivial task. A simple string search for <a href=will only work sometimes. However, if this is just a toy program that you're using to familiarize yourself with C++, a simple string search may suffice for your purposes. Otherwise, you need to use a serious HTML parsing library.

现在,我说网络爬虫在概念上很简单,但实现起来却不是那么简单。从上面的算法中可以看出,您需要:一个 HTTP 网络库来允许您下载 URL,以及一个好的 HTML 解析器,可以让您提取链接。你提到你可以wget用来下载页面。这在某种程度上简化了事情,但您仍然需要实际解析下载的 HTML 文档。正确解析 HTML 是一项非常重要的任务。简单的字符串搜索<a href=有时会起作用。但是,如果这只是您用来熟悉 C++ 的玩具程序,那么简单的字符串搜索可能就足以满足您的需求。否则,您需要使用严肃的 HTML 解析库。

There are also other considerations you need to take into account when writing a webcrawler, such as politeness.People will be pissed and possibly ban your IP if you attempt to download too many pages, too quickly, from the same host. So you may need to implement some sort of policy where your webcrawler waits for a short period before downloading each site. You also need some mechanism to avoid downloading the same URL again, obey the robots exclusion protocol, avoid crawler traps, etc... All these details add up to make actually implementing a robust webcrawler not such a simple thing.

在编写网络爬虫时,您还需要考虑其他注意事项,例如礼貌。如果您尝试从同一主机上下载太多页面,速度太快,人们会生气并可能禁止您的 IP。因此,您可能需要实施某种策略,让您的网络爬虫在下载每个站点之前等待一小段时间。您还需要一些机制来避免再次下载相同的 URL,遵守机器人排除协议,避免爬虫陷阱等......所有这些细节加起来使实际实现一个强大的网络爬虫不是那么简单的事情。

That said, I agree with larsmans in the comments. A webcrawler isn't the greatest way to learn C++. Also, C++ isn't the greatest language to write a webcrawler in. The raw-performance and low-level access you get in C++ is useless when writing a program like a webcrawler, which spends most of its time waiting for URLs to resolve and download. A higher-level scripting language like Python or something is better suited for this task, in my opinion.

也就是说,我同意 larsmans 在评论中的观点。网络爬虫不是学习 C++ 的最佳方式。此外,C++ 不是编写网络爬虫的最佳语言。在编写像网络爬虫这样的程序时,您在 C++ 中获得的原始性能和低级访问是无用的,因为它大部分时间都在等待 URL 解析和下载。在我看来,像 Python 之类的高级脚本语言更适合这项任务。

回答by user2195463

Check this Web crawler and indexer written in C++ at: Mitza web crawlerThe code can be used as reference. Is clean and provides good start for a webcrawler codding. Sequence diagrams can be found at the above link pages.

检查这个用 C++ 编写的网络爬虫和索引器:Mitza web crawler代码可以作为参考。干净并且为网络爬虫编码提供了良好的开端。序列图可以在上面的链接页面中找到。