php 通过链接获取网站标题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4348912/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 12:45:47  来源:igfitidea点击:

Get title of website via link

phpfunctionurlhyperlinkpage-title

提问by Noob

Notice how Google Newshas sources on the bottom of each article excerpt.

请注意Google 新闻如何在每篇文章摘录的底部提供来源。

The Guardian - ABC News - Reuters - Bloomberg

卫报 - ABC 新闻 - 路透社 - 彭博

I'm trying to imitate that.

我试图模仿那个。

For example, upon submitting the URL http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/I want to return The Washington Times

例如,在提交http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/我想返回的 URL 时The Washington Times

How is this possible with php?

这怎么可能用 php?

回答by Jose Vega

My answer is expanding on @AI W's answer of using the title of the page. Below is the code to accomplish what he said.

我的答案是扩展@AI W 使用页面标题的答案。下面是实现他所说的代码。

<?php

function get_title($url){
  $str = file_get_contents($url);
  if(strlen($str)>0){
    $str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
    preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
    return $title[1];
  }
}
//Example:
echo get_title("http://www.washingtontimes.com/");

?>

OUTPUT

输出

Washington Times - Politics, Breaking News, US and World News

华盛顿时报 - 、突发新闻、美国和世界新闻

As you can see, it is not exactly what Google is using, so this leads me to believe that they get a URL's hostname and match it to their own list.

正如您所看到的,这并不是 Google 正在使用的,所以这让我相信他们获得了 URL 的主机名并将其与他们自己的列表相匹配。

http://www.washingtontimes.com/=> The Washington Times

http://www.washingtontimes.com/=> 华盛顿时报

回答by Matthew

$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/');
$xpath = new DOMXPath($doc);
echo $xpath->query('//title')->item(0)->nodeValue."\n";

Output:

输出:

Debt commission falls short on test vote - Washington Times

债务委员会未能通过测试投票 - 华盛顿时报

Obviously you should also implement basic error handling.

显然,您还应该实现基本的错误处理。

回答by James Sumners

You could fetch the contents of the URL and do a regular expression search for the content of the titleelement.

您可以获取 URL 的内容并对title元素的内容进行正则表达式搜索。

<?php
$urlContents = file_get_contents("http://example.com/");
preg_match("/<title>(.*)<\/title>/i", $urlContents, $matches);

print($matches[1] . "\n"); // "Example Web Page"
?>

Or, if you don't want to use a regular expression (to match something very near the top of the document), you could use a DOMDocument object:

或者,如果您不想使用正则表达式(以匹配非常靠近文档顶部的内容),您可以使用DOMDocument 对象

<?php
$urlContents = file_get_contents("http://example.com/");

$dom = new DOMDocument();
@$dom->loadHTML($urlContents);

$title = $dom->getElementsByTagName('title');

print($title->item(0)->nodeValue . "\n"); // "Example Web Page"
?>

I leave it up to you to decide which method you like best.

我让你来决定你最喜欢哪种方法。

回答by Cups

Using get_meta_tags() from the domain home page, for NYT brings back something which might need truncating but could be useful.

使用域主页上的 get_meta_tags() ,因为 NYT 带回了一些可能需要截断但可能有用的东西。

$b = "http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/" ;

$url = parse_url( $b ) ;

$tags = get_meta_tags( $url['scheme'].'://'.$url['host'] );
var_dump( $tags );

includes the description 'The Washington Times delivers breaking news and commentary on the issues that affect the future of our nation.'

包括描述“华盛顿时报就影响我们国家未来的问题提供突发新闻和评论。”

回答by Novikov

PHP manual on cURL

cURL 上的 PHP 手册

<?php

$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");

curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);

curl_exec($ch);
curl_close($ch);
fclose($fp);
?>

PHP manual on Perl regex matching

关于 Perl 正则表达式匹配的 PHP 手册

<?php
$subject = "abcdef";
$pattern = '/^def/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 3);
print_r($matches);
?>

And putting those two together:

并将这两者放在一起:

<?php 
// create curl resource 
$ch = curl_init(); 

// set url 
curl_setopt($ch, CURLOPT_URL, "example.com"); 

//return the transfer as a string 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

// $output contains the output string 
$output = curl_exec($ch); 

$pattern = '/[<]title[>]([^<]*)[<][\/]titl/i';

preg_match($pattern, $output, $matches);

print_r($matches);

// close curl resource to free up system resources 
curl_close($ch);      
?>

I can't promise this example will work since I don't have PHP here, but it should help you get started.

我不能保证这个例子会起作用,因为我这里没有 PHP,但它应该可以帮助你开始。

回答by Sudhir Jonathan

If you're willing to use a third party service for this, I just built one at www.runway7.net/radar

如果您愿意为此使用第三方服务,我刚刚在www.runway7.net/radar 上构建了一个

Gives you title, description and much more. For instance, try your example on Radar. (http://radar.runway7.net/?url=http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/)

为您提供标题、描述等信息。例如,在 Radar 上试试你的例子。(http://radar.runway7.net/?url=http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/

回答by Kise Xu

Get title of website via link and convert title to utf-8 character encoding:

通过链接获取网站标题并将标题转换为utf-8字符编码:

https://gist.github.com/kisexu/b64bc6ab787f302ae838

https://gist.github.com/kisexu/b64bc6ab787f302ae838

function getTitle($url)
{
    // get html via url
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    $html = curl_exec($ch);
    curl_close($ch);

    // get title
    preg_match('/(?<=<title>).+(?=<\/title>)/iU', $html, $match);
    $title = empty($match[0]) ? 'Untitled' : $match[0];
    $title = trim($title);

    // convert title to utf-8 character encoding
    if ($title != 'Untitled') {
        preg_match('/(?<=charset\=).+(?=\")/iU', $html, $match);
        if (!empty($match[0])) {
            $charset = str_replace('"', '', $match[0]);
            $charset = str_replace("'", '', $charset);
            $charset = strtolower( trim($charset) );
            if ($charset != 'utf-8') {
                $title = iconv($charset, 'utf-8', $title);
            }
        }
    }

    return $title;
}

回答by xianyu

i wrote a function to handle it:

我写了一个函数来处理它:

 function getURLTitle($url){

    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, $url);

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $content = curl_exec($ch);

    $contentType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
    $charset = '';

    if($contentType && preg_match('/\bcharset=(.+)\b/i', $contentType, $matches)){
        $charset = $matches[1];
    }

    curl_close($ch);

    if(strlen($content) > 0 && preg_match('/\<title\b.*\>(.*)\<\/title\>/i', $content, $matches)){
        $title = $matches[1];

        if(!$charset && preg_match_all('/\<meta\b.*\>/i', $content, $matches)){
            //order:
            //http header content-type
            //meta http-equiv content-type
            //meta charset
            foreach($matches as $match){
                $match = strtolower($match);
                if(strpos($match, 'content-type') && preg_match('/\bcharset=(.+)\b/', $match, $ms)){
                    $charset = $ms[1];
                    break;
                }
            }

            if(!$charset){
                //meta charset=utf-8
                //meta charset='utf-8'
                foreach($matches as $match){
                    $match = strtolower($match);
                    if(preg_match('/\bcharset=([\'"])?(.+)?/', $match, $ms)){
                        $charset = $ms[1];
                        break;
                    }
                }
            }
        }

        return $charset ? iconv($charset, 'utf-8', $title) : $title;
    }

    return $url;
}

it fetches the webpage content, and tries to get document charset encoding by ((from highest priority to lowest):

它获取网页内容,并尝试通过((从最高优先级到最低优先级)获取文档字符集编码:

  1. An HTTP "charset" parameter in a "Content-Type" field.
  2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  3. The charset attribute set on an element that designates an external resource.
  1. “内容类型”字段中的 HTTP“字符集”参数。
  2. 将“http-equiv”设置为“Content-Type”并为“charset”设置值的 META 声明。
  3. 在指定外部资源的元素上设置的字符集属性。

(see http://www.w3.org/TR/html4/charset.html)

(见http://www.w3.org/TR/html4/charset.html

and then uses iconvto convert title to utf-8encoding.

然后用于iconv将标题转换为utf-8编码。

回答by István Ujj-Mészáros

Alternatively you can use Simple Html Dom Parser:

或者,您可以使用Simple Html Dom Parser

<?php
require_once('simple_html_dom.php');

$html = file_get_html('http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/');

echo $html->find('title', 0)->innertext . "<br>\n";

echo $html->find('div[class=entry-content]', 0)->innertext;

回答by Jake

I try to avoid regular expressions when it isn't necessary, I have made a function to get the website title with curl and DOMDocument below.

我尽量避免在不需要时使用正则表达式,我已经创建了一个函数来获取带有 curl 和 DOMDocument 的网站标题。

function website_title($url) {
   $ch = curl_init();
   curl_setopt($ch, CURLOPT_URL, $url);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
   // some websites like Facebook need a user agent to be set.
   curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
   $html = curl_exec($ch);
   curl_close($ch);

   $dom  = new DOMDocument;
   @$dom->loadHTML($html);

   $title = $dom->getElementsByTagName('title')->item('0')->nodeValue;
   return $title;
}

echo website_title('https://www.facebook.com/');

above returns the following: Welcome to Facebook - Log In, Sign Up or Learn More

以上返回以下内容:欢迎使用 Facebook - 登录、注册或了解更多