C# 中的简单网络爬虫
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10452749/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Simple web crawler in C#
提问by Khaled Mohamed
I have created a simple web crawler but i want to add the recursion function so that every page that is opened i can get the urls in this page,but i have no idea how i can do that and i want also to include threads to make it faster here it is my code
我创建了一个简单的网络爬虫,但我想添加递归函数,以便打开的每个页面都可以获取此页面中的 url,但我不知道如何做到这一点,我还想包括要制作的线程它更快,这是我的代码
namespace Crawler
{
public partial class Form1 : Form
{
String Rstring;
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
WebRequest myWebRequest;
WebResponse myWebResponse;
String URL = textBox1.Text;
myWebRequest = WebRequest.Create(URL);
myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource
Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
//and save it in the stream
StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
Rstring = sreader.ReadToEnd();//reads it to the end
String Links = GetContent(Rstring);//gets the links only
textBox2.Text = Rstring;
textBox3.Text = Links;
streamResponse.Close();
sreader.Close();
myWebResponse.Close();
}
private String GetContent(String Rstring)
{
String sString="";
HTMLDocument d = new HTMLDocument();
IHTMLDocument2 doc = (IHTMLDocument2)d;
doc.write(Rstring);
IHTMLElementCollection L = doc.links;
foreach (IHTMLElement links in L)
{
sString += links.getAttribute("href", 0);
sString += "/n";
}
return sString;
}
采纳答案by Darius Kucinskas
I fixed your GetContent method as follow to get new links from crawled page:
我按如下方式修复了您的 GetContent 方法,以从抓取的页面中获取新链接:
public ISet<string> GetNewLinks(string content)
{
Regex regexLink = new Regex("(?<=<a\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");
ISet<string> newLinks = new HashSet<string>();
foreach (var match in regexLink.Matches(content))
{
if (!newLinks.Contains(match.ToString()))
newLinks.Add(match.ToString());
}
return newLinks;
}
Updated
更新
Fixed: regex should be regexLink. Thanks @shashlearner for pointing this out (my mistype).
修正:regex 应该是 regexLink。感谢@shashlearner 指出这一点(我的错误类型)。
回答by Connor
The following includes an answer/recommendation.
以下包括答案/建议。
I believe you should use a dataGridViewinstead of a textBoxas when you look at it in GUI it is easier to see the links (URLs) found.
我相信您应该使用 adataGridView而不是 a ,textBox因为当您在 GUI 中查看它时,更容易看到找到的链接(URL)。
You could change:
你可以改变:
textBox3.Text = Links;
to
到
dataGridView.DataSource = Links;
Now for the question, you haven't included:
现在对于这个问题,您还没有包括:
using System. "'s"
which ones were used, as it would be appreciated if I could get them as can't figure it out.
使用了哪些,因为如果我无法弄明白它们,我将不胜感激。
回答by Tom
From a design standpoint, I've written a few webcrawlers. Basically you want to implement a Depth First Search using a Stack data structure. You can use Breadth First Search also, but you'll likely come into stack memory issues. Good luck.
从设计的角度来看,我写了一些网络爬虫。基本上你想使用堆栈数据结构实现深度优先搜索。您也可以使用广度优先搜索,但您可能会遇到堆栈内存问题。祝你好运。
回答by Misterhex
i have created something similar using Reactive Extension.
我使用Reactive Extension创建了类似的东西。
https://github.com/Misterhex/WebCrawler
https://github.com/Misterhex/WebCrawler
i hope it can help you.
我希望它能帮助你。
Crawler crawler = new Crawler();
IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));
observable.Subscribe(onNext: Console.WriteLine,
onCompleted: () => Console.WriteLine("Crawling completed"));

