使用 C# 从网站获取 HTML 代码

Question

提问by ggcodes

How to get the HTML code from a website, save it, and find some text by a LINQ expression?

如何从网站获取 HTML 代码、保存它并通过 LINQ 表达式查找一些文本？

I'm using the following code to get the source of a web page:

我正在使用以下代码来获取网页的来源：

public static String code(string Url)
{
    HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(Url);
    myRequest.Method = "GET";
    WebResponse myResponse = myRequest.GetResponse();
    StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
    string result = sr.ReadToEnd();
    sr.Close();
    myResponse.Close();

    return result;
 }

How do I find the text in a div in the source of the web page?

如何在网页源代码的 div 中找到文本？

Answer 1

采纳答案by SyntaxError

Getting HTML code from a website. You can use code like this.

从网站获取 HTML 代码。您可以使用这样的代码。

string urlAddress = "http://google.com";

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

if (response.StatusCode == HttpStatusCode.OK)
{
  Stream receiveStream = response.GetResponseStream();
  StreamReader readStream = null;

  if (String.IsNullOrWhiteSpace(response.CharacterSet))
     readStream = new StreamReader(receiveStream);
  else
     readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));

  string data = readStream.ReadToEnd();

  response.Close();
  readStream.Close();
}

This will give you the returned HTMLcode from the website. But find text via LINQis not that easy. Perhaps it is better to use regular expression but that does not play well with HTMLcode

这将为您提供从网站返回的HTML代码。但是通过LINQ查找文本并不是那么容易。也许使用正则表达式会更好，但在HTML代码中效果不佳

Answer 2

回答by jammykam

Best thing to use is HTMLAgilityPack. You can also look into using Fizzleror CSQuerydepending on your needs for selecting the elements from the retrieved page. Using LINQ or Regukar Expressions is just to error prone, especially when the HTML can be malformed, missing closing tags, have nested child elements etc.

最好使用的是HTMLAgilityPack。您还可以根据从检索到的页面中选择元素的需要，考虑使用Fizzler或CSQuery。使用 LINQ 或 Regukar 表达式只是容易出错，尤其是当 HTML 格式错误、缺少结束标记、嵌套子元素等时。

You need to stream the page into an HtmlDocument object and then select your required element.

您需要将页面流式传输到 HtmlDocument 对象中，然后选择所需的元素。

// Call the page and get the generated HTML
var doc = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNode.ElementsFlags["br"] = HtmlAgilityPack.HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;

try
{
    var webRequest = HttpWebRequest.Create(pageUrl);
    Stream stream = webRequest.GetResponse().GetResponseStream();
    doc.Load(stream);
    stream.Close();
}
catch (System.UriFormatException uex)
{
    Log.Fatal("There was an error in the format of the url: " + itemUrl, uex);
    throw;
}
catch (System.Net.WebException wex)
{
    Log.Fatal("There was an error connecting to the url: " + itemUrl, wex);
    throw;
}

//get the div by id and then get the inner text 
string testDivSelector = "//div[@id='test']";
var divString = doc.DocumentNode.SelectSingleNode(testDivSelector).InnerHtml.ToString();

[EDIT] Actually, scrap that. The simplest method is to use FizzlerEx, an updated jQuery/CSS3-selectors implementation of the original Fizzler project.

[编辑] 实际上，废弃那个。最简单的方法是使用FizzlerEx，这是原始 Fizzler 项目的更新 jQuery/CSS3 选择器实现。

Code sample directly from their site:

直接来自他们网站的代码示例：

using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;

//get the page
var web = new HtmlWeb();
var document = web.Load("http://example.com/page.html");
var page = document.DocumentNode;

//loop through all div tags with item css class
foreach(var item in page.QuerySelectorAll("div.item"))
{
    var title = item.QuerySelector("h3:not(.share)").InnerText;
    var date = DateTime.Parse(item.QuerySelector("span:eq(2)").InnerText);
    var description = item.QuerySelector("span:has(b)").InnerHtml;
}

I don't think it can get any simpler than that.

我认为没有比这更简单的了。

Answer 3

回答by Santosh Panda

Better you can use the Webclient class to simplify your task:

您可以更好地使用 Webclient 类来简化您的任务：

using System.Net;

using (WebClient client = new WebClient())
{
    string htmlCode = client.DownloadString("http://somesite.com/default.html");
}

Answer 4

回答by Mohamed Sayed

Here's an example of using the HttpWebRequestclass to fetch a URL

这是使用HttpWebRequest该类获取 URL的示例

private void buttonl_Click(object sender, EventArgs e) 
{ 
    String url = TextBox_url.Text;
    HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url); 
    HttpWebResponse response = (HttpWebResponse) request.GetResponse(); 
    StreamReader sr = new StreamReader(response.GetResponseStream()); 
    richTextBox1.Text = sr.ReadToEnd(); 
    sr.Close(); 
}

Answer 5

回答by youssef

Try this solution. It works fine.

试试这个解决方案。它工作正常。

 try{
        String url = textBox1.Text;
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        StreamReader sr = new StreamReader(response.GetResponseStream());
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.Load(sr);
        var aTags = doc.DocumentNode.SelectNodes("//a");
        int counter = 1;
        if (aTags != null)
        {
            foreach (var aTag in aTags)
            {
                richTextBox1.Text +=  aTag.InnerHtml +  "\n" ;
                counter++;
            }
        }
        sr.Close();
        }
        catch (Exception ex)
        {
            MessageBox.Show("Failed to retrieve related keywords." + ex);
        }

Answer 6

回答by Tickseeker

I am using AngleSharpand have been very satisfied with it.

我正在使用AngleSharp并且对它非常满意。

Here is a simple example how to fetch a page:

这是一个如何获取页面的简单示例：

var config = Configuration.Default.WithDefaultLoader();
var document = await BrowsingContext.New(config).OpenAsync("https://www.google.com");

And now you have a web page in documentvariable. Then you can easily access it by LINQ or other methods. For example if you want to get a string value from a HTML table:

现在您在文档变量中有一个网页。然后就可以很方便的通过LINQ或者其他方式访问了。例如，如果您想从 HTML 表中获取字符串值：

var someStringValue = document.All.Where(m =>
        m.LocalName == "td" &&
        m.HasAttribute("class") &&
        m.GetAttribute("class").Contains("pid-1-bid")
    ).ElementAt(0).TextContent.ToString();

To use CSS selectors please see AngleSharp examples.

要使用 CSS 选择器，请参阅AngleSharp 示例。

使用 C# 从网站获取 HTML 代码

提问by ggcodes

采纳答案by SyntaxError

回答by jammykam

回答by Santosh Panda

回答by Mohamed Sayed

回答by youssef

回答by Tickseeker

相关推荐

最近更新

标签

使用 C# 从网站获取 HTML 代码

提问by ggcodes

采纳答案by SyntaxError

回答by jammykam

回答by Santosh Panda

回答by Mohamed Sayed

回答by youssef

回答by Tickseeker

相关推荐

c# - 如何1乘1读取文件中的字符

C# 按属性值对对象列表进行排序

C# 根据请求从 MVC web api 返回 xml 或 json

C# 如何做30分钟倒计时

相关推荐

最近更新

标签