C# 中的简单网络爬虫

Question

提问by Khaled Mohamed

I have created a simple web crawler but i want to add the recursion function so that every page that is opened i can get the urls in this page,but i have no idea how i can do that and i want also to include threads to make it faster here it is my code

我创建了一个简单的网络爬虫，但我想添加递归函数，以便打开的每个页面都可以获取此页面中的 url，但我不知道如何做到这一点，我还想包括要制作的线程它更快，这是我的代码

namespace Crawler
{
    public partial class Form1 : Form
    {
        String Rstring;

        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {

            WebRequest myWebRequest;
            WebResponse myWebResponse;
            String URL = textBox1.Text;

            myWebRequest =  WebRequest.Create(URL);
            myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource

            Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
                                                                       //and save it in the stream

            StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
            Rstring = sreader.ReadToEnd();//reads it to the end
            String Links = GetContent(Rstring);//gets the links only

            textBox2.Text = Rstring;
            textBox3.Text = Links;
            streamResponse.Close();
            sreader.Close();
            myWebResponse.Close();




        }

        private String GetContent(String Rstring)
        {
            String sString="";
            HTMLDocument d = new HTMLDocument();
            IHTMLDocument2 doc = (IHTMLDocument2)d;
            doc.write(Rstring);

            IHTMLElementCollection L = doc.links;

            foreach (IHTMLElement links in  L)
            {
                sString += links.getAttribute("href", 0);
                sString += "/n";
            }
            return sString;
        }

Answer 1

采纳答案by Darius Kucinskas

I fixed your GetContent method as follow to get new links from crawled page:

我按如下方式修复了您的 GetContent 方法，以从抓取的页面中获取新链接：

public ISet<string> GetNewLinks(string content)
{
    Regex regexLink = new Regex("(?<=<a\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");

    ISet<string> newLinks = new HashSet<string>();    
    foreach (var match in regexLink.Matches(content))
    {
        if (!newLinks.Contains(match.ToString()))
            newLinks.Add(match.ToString());
    }

    return newLinks;
}

Updated

更新

Fixed: regex should be regexLink. Thanks @shashlearner for pointing this out (my mistype).

修正：regex 应该是 regexLink。感谢@shashlearner 指出这一点（我的错误类型）。

Answer 2

回答by Connor

The following includes an answer/recommendation.

以下包括答案/建议。

I believe you should use a dataGridViewinstead of a textBoxas when you look at it in GUI it is easier to see the links (URLs) found.

我相信您应该使用 adataGridView而不是 a ，textBox因为当您在 GUI 中查看它时，更容易看到找到的链接（URL）。

You could change:

你可以改变：

textBox3.Text = Links;

to

到

 dataGridView.DataSource = Links;

Now for the question, you haven't included:

现在对于这个问题，您还没有包括：

using System.  "'s"

which ones were used, as it would be appreciated if I could get them as can't figure it out.

使用了哪些，因为如果我无法弄明白它们，我将不胜感激。

Answer 3

回答by Tom

From a design standpoint, I've written a few webcrawlers. Basically you want to implement a Depth First Search using a Stack data structure. You can use Breadth First Search also, but you'll likely come into stack memory issues. Good luck.

从设计的角度来看，我写了一些网络爬虫。基本上你想使用堆栈数据结构实现深度优先搜索。您也可以使用广度优先搜索，但您可能会遇到堆栈内存问题。祝你好运。

Answer 4

回答by Misterhex

i have created something similar using Reactive Extension.

我使用Reactive Extension创建了类似的东西。

https://github.com/Misterhex/WebCrawler

i hope it can help you.

我希望它能帮助你。

Crawler crawler = new Crawler();

IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));

observable.Subscribe(onNext: Console.WriteLine, 
onCompleted: () => Console.WriteLine("Crawling completed"));

C# 中的简单网络爬虫

提问by Khaled Mohamed

采纳答案by Darius Kucinskas

回答by Connor

回答by Tom

回答by Misterhex

相关推荐

最近更新

标签

C# 中的简单网络爬虫

提问by Khaled Mohamed

采纳答案by Darius Kucinskas

回答by Connor

回答by Tom

回答by Misterhex

相关推荐

C# 下拉列表选择的值不起作用

C# 如何将带有查询字符串的 URL 作为查询字符串发送

C# Try catch finally: 如果没有抛出异常，就做点什么

C# DBNull if 语句

相关推荐

最近更新

标签