php 通过链接获取网站标题
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4348912/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get title of website via link
提问by Noob
Notice how Google Newshas sources on the bottom of each article excerpt.
请注意Google 新闻如何在每篇文章摘录的底部提供来源。
The Guardian - ABC News - Reuters - Bloomberg
卫报 - ABC 新闻 - 路透社 - 彭博
I'm trying to imitate that.
我试图模仿那个。
For example, upon submitting the URL http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/
I want to return The Washington Times
例如,在提交http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/
我想返回的 URL 时The Washington Times
How is this possible with php?
这怎么可能用 php?
回答by Jose Vega
My answer is expanding on @AI W's answer of using the title of the page. Below is the code to accomplish what he said.
我的答案是扩展@AI W 使用页面标题的答案。下面是实现他所说的代码。
<?php
function get_title($url){
$str = file_get_contents($url);
if(strlen($str)>0){
$str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
return $title[1];
}
}
//Example:
echo get_title("http://www.washingtontimes.com/");
?>
OUTPUT
输出
Washington Times - Politics, Breaking News, US and World News
华盛顿时报 - 、突发新闻、美国和世界新闻
As you can see, it is not exactly what Google is using, so this leads me to believe that they get a URL's hostname and match it to their own list.
正如您所看到的,这并不是 Google 正在使用的,所以这让我相信他们获得了 URL 的主机名并将其与他们自己的列表相匹配。
http://www.washingtontimes.com/=> The Washington Times
http://www.washingtontimes.com/=> 华盛顿时报
回答by Matthew
$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/');
$xpath = new DOMXPath($doc);
echo $xpath->query('//title')->item(0)->nodeValue."\n";
Output:
输出:
Debt commission falls short on test vote - Washington Times
债务委员会未能通过测试投票 - 华盛顿时报
Obviously you should also implement basic error handling.
显然,您还应该实现基本的错误处理。
回答by James Sumners
You could fetch the contents of the URL and do a regular expression search for the content of the title
element.
您可以获取 URL 的内容并对title
元素的内容进行正则表达式搜索。
<?php
$urlContents = file_get_contents("http://example.com/");
preg_match("/<title>(.*)<\/title>/i", $urlContents, $matches);
print($matches[1] . "\n"); // "Example Web Page"
?>
Or, if you don't want to use a regular expression (to match something very near the top of the document), you could use a DOMDocument object:
或者,如果您不想使用正则表达式(以匹配非常靠近文档顶部的内容),您可以使用DOMDocument 对象:
<?php
$urlContents = file_get_contents("http://example.com/");
$dom = new DOMDocument();
@$dom->loadHTML($urlContents);
$title = $dom->getElementsByTagName('title');
print($title->item(0)->nodeValue . "\n"); // "Example Web Page"
?>
I leave it up to you to decide which method you like best.
我让你来决定你最喜欢哪种方法。
回答by Cups
Using get_meta_tags() from the domain home page, for NYT brings back something which might need truncating but could be useful.
使用域主页上的 get_meta_tags() ,因为 NYT 带回了一些可能需要截断但可能有用的东西。
$b = "http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/" ;
$url = parse_url( $b ) ;
$tags = get_meta_tags( $url['scheme'].'://'.$url['host'] );
var_dump( $tags );
includes the description 'The Washington Times delivers breaking news and commentary on the issues that affect the future of our nation.'
包括描述“华盛顿时报就影响我们国家未来的问题提供突发新闻和评论。”
回答by Novikov
<?php
$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>
PHP manual on Perl regex matching
<?php
$subject = "abcdef";
$pattern = '/^def/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 3);
print_r($matches);
?>
And putting those two together:
并将这两者放在一起:
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
$pattern = '/[<]title[>]([^<]*)[<][\/]titl/i';
preg_match($pattern, $output, $matches);
print_r($matches);
// close curl resource to free up system resources
curl_close($ch);
?>
I can't promise this example will work since I don't have PHP here, but it should help you get started.
我不能保证这个例子会起作用,因为我这里没有 PHP,但它应该可以帮助你开始。
回答by Sudhir Jonathan
If you're willing to use a third party service for this, I just built one at www.runway7.net/radar
如果您愿意为此使用第三方服务,我刚刚在www.runway7.net/radar 上构建了一个
Gives you title, description and much more. For instance, try your example on Radar. (http://radar.runway7.net/?url=http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/)
为您提供标题、描述等信息。例如,在 Radar 上试试你的例子。(http://radar.runway7.net/?url=http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/)
回答by Kise Xu
Get title of website via link and convert title to utf-8 character encoding:
通过链接获取网站标题并将标题转换为utf-8字符编码:
https://gist.github.com/kisexu/b64bc6ab787f302ae838
https://gist.github.com/kisexu/b64bc6ab787f302ae838
function getTitle($url)
{
// get html via url
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$html = curl_exec($ch);
curl_close($ch);
// get title
preg_match('/(?<=<title>).+(?=<\/title>)/iU', $html, $match);
$title = empty($match[0]) ? 'Untitled' : $match[0];
$title = trim($title);
// convert title to utf-8 character encoding
if ($title != 'Untitled') {
preg_match('/(?<=charset\=).+(?=\")/iU', $html, $match);
if (!empty($match[0])) {
$charset = str_replace('"', '', $match[0]);
$charset = str_replace("'", '', $charset);
$charset = strtolower( trim($charset) );
if ($charset != 'utf-8') {
$title = iconv($charset, 'utf-8', $title);
}
}
}
return $title;
}
回答by xianyu
i wrote a function to handle it:
我写了一个函数来处理它:
function getURLTitle($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
$contentType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
$charset = '';
if($contentType && preg_match('/\bcharset=(.+)\b/i', $contentType, $matches)){
$charset = $matches[1];
}
curl_close($ch);
if(strlen($content) > 0 && preg_match('/\<title\b.*\>(.*)\<\/title\>/i', $content, $matches)){
$title = $matches[1];
if(!$charset && preg_match_all('/\<meta\b.*\>/i', $content, $matches)){
//order:
//http header content-type
//meta http-equiv content-type
//meta charset
foreach($matches as $match){
$match = strtolower($match);
if(strpos($match, 'content-type') && preg_match('/\bcharset=(.+)\b/', $match, $ms)){
$charset = $ms[1];
break;
}
}
if(!$charset){
//meta charset=utf-8
//meta charset='utf-8'
foreach($matches as $match){
$match = strtolower($match);
if(preg_match('/\bcharset=([\'"])?(.+)?/', $match, $ms)){
$charset = $ms[1];
break;
}
}
}
}
return $charset ? iconv($charset, 'utf-8', $title) : $title;
}
return $url;
}
it fetches the webpage content, and tries to get document charset encoding by ((from highest priority to lowest):
它获取网页内容,并尝试通过((从最高优先级到最低优先级)获取文档字符集编码:
- An HTTP "charset" parameter in a "Content-Type" field.
- A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
- The charset attribute set on an element that designates an external resource.
- “内容类型”字段中的 HTTP“字符集”参数。
- 将“http-equiv”设置为“Content-Type”并为“charset”设置值的 META 声明。
- 在指定外部资源的元素上设置的字符集属性。
(see http://www.w3.org/TR/html4/charset.html)
(见http://www.w3.org/TR/html4/charset.html)
and then uses iconv
to convert title to utf-8
encoding.
然后用于iconv
将标题转换为utf-8
编码。
回答by István Ujj-Mészáros
Alternatively you can use Simple Html Dom Parser:
或者,您可以使用Simple Html Dom Parser:
<?php
require_once('simple_html_dom.php');
$html = file_get_html('http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/');
echo $html->find('title', 0)->innertext . "<br>\n";
echo $html->find('div[class=entry-content]', 0)->innertext;
回答by Jake
I try to avoid regular expressions when it isn't necessary, I have made a function to get the website title with curl and DOMDocument below.
我尽量避免在不需要时使用正则表达式,我已经创建了一个函数来获取带有 curl 和 DOMDocument 的网站标题。
function website_title($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// some websites like Facebook need a user agent to be set.
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
@$dom->loadHTML($html);
$title = $dom->getElementsByTagName('title')->item('0')->nodeValue;
return $title;
}
echo website_title('https://www.facebook.com/');
above returns the following: Welcome to Facebook - Log In, Sign Up or Learn More
以上返回以下内容:欢迎使用 Facebook - 登录、注册或了解更多