如何从c#获取网站标题
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/329307/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get website title from c#
提问by Morten Christiansen
I'm revisiting som old code of mine and have stumbled upon a method for getting the title of a website based on its url. It's not really what you would call a stable method as it often fails to produce a result and sometimes even produces incorrect results. Also, sometimes it fails to show some of the characters from the title as they are of an alternative encoding.
我正在重新审视我的一些旧代码,并偶然发现了一种根据网址获取网站标题的方法。这并不是您所说的稳定方法,因为它经常无法产生结果,有时甚至会产生不正确的结果。此外,有时它无法显示标题中的某些字符,因为它们具有替代编码。
Does anyone have suggestions for improvements over this old version?
有没有人对这个旧版本有改进的建议?
public static string SuggestTitle(string url, int timeout)
{
WebResponse response = null;
string line = string.Empty;
try
{
WebRequest request = WebRequest.Create(url);
request.Timeout = timeout;
response = request.GetResponse();
Stream streamReceive = response.GetResponseStream();
Encoding encoding = System.Text.Encoding.GetEncoding("utf-8");
StreamReader streamRead = new System.IO.StreamReader(streamReceive, encoding);
while(streamRead.EndOfStream != true)
{
line = streamRead.ReadLine();
if (line.Contains("<title>"))
{
line = line.Split(new char[] { '<', '>' })[2];
break;
}
}
}
catch (Exception) { }
finally
{
if (response != null)
{
response.Close();
}
}
return line;
}
One final note - I would like the code to run faster as well, as it is blocking until the page as been fetched, so if I can get only the site header and not the entire page, it would be great.
最后一点 - 我也希望代码运行得更快,因为它在页面被获取之前一直处于阻塞状态,所以如果我只能获取站点标题而不是整个页面,那就太好了。
采纳答案by Timothy Khouri
A simpler way to get the content:
获取内容的更简单方法:
WebClient x = new WebClient();
string source = x.DownloadString("http://www.singingeels.com/");
A simpler, more reliable way to get the title:
获得标题的更简单、更可靠的方法:
string title = Regex.Match(source, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>",
RegexOptions.IgnoreCase).Groups["Title"].Value;
回答by Nick Berardi
Inorder to accomplish this you are going to need to do a couple of things.
为了实现这一点,您需要做一些事情。
- Make your app threaded, so that you can process multiple requests at the time and maximize the number of HTTP requests that are being made.
- Durring the async request, download only the amount of data you want to pull back, you could probably do parsing on the data as it comes back looking for
- Probably want to use regex to pull out the title name
- 使您的应用程序线程化,以便您可以一次处理多个请求并最大限度地增加正在发出的 HTTP 请求数。
- 在异步请求期间,仅下载您想要撤回的数据量,您可能会在数据返回寻找时对其进行解析
- 大概是想用regex把标题名拉出来
I have done this before with SEO bots and I have been able to handle almost 10,000 requests at a single time. You just need to make sure that each web request can be self contained in a thread.
我之前使用 SEO 机器人完成了这项工作,并且我已经能够一次处理近 10,000 个请求。您只需要确保每个 Web 请求都可以自包含在一个线程中。
回答by Roberto B
Perhaps with this suggestion a new world opens up for you I also had this question and came to this
也许这个建议为你打开了一个新世界 我也有这个问题,来到这个
Download "Html Agility Pack" from http://html-agility-pack.net/?z=codeplex
从http://html-agility-pack.net/?z=codeplex下载“Html Agility Pack”
Or go to nuget: https://www.nuget.org/packages/HtmlAgilityPack/And add in this reference.
或者去 nuget: https://www.nuget.org/packages/HtmlAgilityPack/并添加这个参考。
Add folow using in the code file:
在代码文件中添加以下使用:
using HtmlAgilityPack;
Write folowing code in your methode:
在您的方法中编写以下代码:
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var title = document.DocumentNode.SelectSingleNode("html/head/title").InnerText;
Sources:
资料来源:
https://codeshare.co.uk/blog/how-to-scrape-meta-data-from-a-url-using-htmlagilitypack-in-c/HtmlAgilityPack obtain Title and meta
https://codeshare.co.uk/blog/how-to-scrape-meta-data-from-a-url-using-htmlagilitypack-in-c/ HtmlAgilityPack 获取 Title 和 meta