php 从外部网站获取标题和元标记
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3711357/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Getting title and meta tags from external website
提问by MacMac
I want to try figure out how to get the
我想弄清楚如何获得
<title>A common title</title>
<meta name="keywords" content="Keywords blabla" />
<meta name="description" content="This is the description" />
Even though if it's arranged in any order, I've heard of the PHP Simple HTML DOM Parser but I don't really want to use it. Is it possible for a solution except using the PHP Simple HTML DOM Parser.
即使它以任何顺序排列,我也听说过 PHP Simple HTML DOM Parser 但我并不真的想使用它。除了使用 PHP 简单 HTML DOM 解析器之外,是否有可能的解决方案。
preg_match
will not be able to do it if it's invalid HTML?
preg_match
如果它是无效的 HTML,将无法做到这一点?
Can cURL do something like this with preg_match?
cURL 可以用 preg_match 做这样的事情吗?
Facebook does something like this but it's properly used by using:
Facebook 做了类似的事情,但它可以通过以下方式正确使用:
<meta property="og:description" content="Description blabla" />
I want something like this so that it is possible when someone posts a link, it should retrieve the title and the meta tags. If there are no meta tags, then it it ignored or the user can set it themselves (but I'll do that later on myself).
我想要这样的东西,以便当有人发布链接时,它应该检索标题和元标记。如果没有元标记,那么它会被忽略,或者用户可以自己设置它(但我稍后会自己做)。
回答by shamittomar
This is the way it should be:
这是应该的方式:
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl("http://example.com/");
//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
if($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}
echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";
回答by Bob Jeey
<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');
// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author']; // name
echo $tags['keywords']; // php documentation
echo $tags['description']; // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>
回答by Lloyd Moore
get_meta_tags
will help you with all but the title. To get the title just use a regex.
get_meta_tags
将帮助您解决除标题之外的所有问题。要获得标题只需使用正则表达式。
$url = 'http://some.url.com';
preg_match("/<title>(.+)<\/title>/siU", file_get_contents($url), $matches);
$title = $matches[1];
Hope that helps.
希望有帮助。
回答by Addo Solutions
Php's native function: get_meta_tags()
PHP 的原生函数:get_meta_tags()
回答by Joshua
Your best bet is to bite the bullet use the DOM Parser - it's the 'right way' to do it. In the long run it'll save you more time than it takes to learn how. Parsing HTML with Regex is known to be unreliable and intolerant of special cases.
最好的办法是硬着头皮使用 DOM 解析器 - 这是做到这一点的“正确方法”。从长远来看,它会为您节省更多的时间,而不是学习如何做。众所周知,使用正则表达式解析 HTML 是不可靠的,并且不能容忍特殊情况。
回答by Harald
get_meta_tags
did not work with title.
get_meta_tags
没有与标题一起工作。
Only meta tags with name attributes like
只有具有名称属性的元标记,例如
<meta name="description" content="the description">
will be parsed.
将被解析。
回答by oknate
Unfortunately, the built in php function get_meta_tags() requires the name parameter, and certain sites, such as twitter leave that off in favor of the property attribute. This function, using a mix of regex and dom document, will return a keyed array of metatags from a webpage. It checks for the name parameter, then the property parameter. This has been tested on instragram, pinterest and twitter.
不幸的是,内置的 php 函数 get_meta_tags() 需要 name 参数,而某些站点(例如 twitter)为了支持 property 属性而将其保留。此函数混合使用正则表达式和 dom 文档,将从网页中返回元标记的键控数组。它检查名称参数,然后检查属性参数。这已经在 instragram、pinterest 和 twitter 上进行了测试。
/**
* Extract metatags from a webpage
*/
function extract_tags_from_url($url) {
$tags = array();
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$contents = curl_exec($ch);
curl_close($ch);
if (empty($contents)) {
return $tags;
}
if (preg_match_all('/<meta([^>]+)content="([^>]+)>/', $contents, $matches)) {
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="utf-8" ?>' . implode($matches[0]));
$tags = array();
foreach($doc->getElementsByTagName('meta') as $metaTag) {
if($metaTag->getAttribute('name') != "") {
$tags[$metaTag->getAttribute('name')] = $metaTag->getAttribute('content');
}
elseif ($metaTag->getAttribute('property') != "") {
$tags[$metaTag->getAttribute('property')] = $metaTag->getAttribute('content');
}
}
}
return $tags;
}
回答by cronoklee
Shouldnt we be using OG?
我们不应该使用OG吗?
The chosen answer is good but doesn't work when a site is redirected(very common!), and doesn't return OG tags, which are the new industry standard. Here's a little function which is a bit more usable in 2018. It tries to get OG tags and falls back to meta tags if it cant them:
选择的答案很好,但在网站重定向时不起作用(非常常见!),并且不返回OG 标签,这是新的行业标准。这是一个在 2018 年更有用的小功能。它尝试获取 OG 标签,如果无法获取,则回退到元标签:
function getSiteOG( $url, $specificTags=0 ){
$doc = new DOMDocument();
@$doc->loadHTML(file_get_contents($url));
$res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;
foreach ($doc->getElementsByTagName('meta') as $m){
$tag = $m->getAttribute('name') ?: $m->getAttribute('property');
if(in_array($tag,['description','keywords']) || strpos($tag,'og:')===0) $res[str_replace('og:','',$tag)] = $m->getAttribute('content');
}
return $specificTags? array_intersect_key( $res, array_flip($specificTags) ) : $res;
}
How to use it:
如何使用它:
/////////////
//SAMPLE USAGE:
print_r(getSiteOG("http://www.stackoverflow.com")); //note the incorrect url
/////////////
//OUTPUT:
Array
(
[title] => Stack Overflow - Where Developers Learn, Share, & Build Careers
[description] => Stack Overflow is the largest, most trusted online community for developers to learn, sharea atheir programming aknowledge, and build their careers.
[type] => website
[url] => https://stackoverflow.com/
[site_name] => Stack Overflow
[image] => https://cdn.sstatic.net/Sites/stackoverflow/img/[email protected]?v=73d79a89bded
)
回答by Jay Dave
Easy and php's in-built function.
Easy 和 php 的内置功能。
回答by sebilasse
We use Apache Tikavia php (command line utility) with -j for json :
我们通过 php(命令行实用程序)使用Apache Tika和 -j 用于 json :
<?php
shell_exec( 'java -jar tika-app-1.4.jar -j http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying' );
?>
This is a sample outputfrom a random guardian article :
这是随机监护人文章的示例输出:
{
"Content-Encoding":"UTF-8",
"Content-Length":205599,
"Content-Type":"text/html; charset\u003dUTF-8",
"DC.date.issued":"2013-07-21",
"X-UA-Compatible":"IE\u003dEdge,chrome\u003d1",
"application-name":"The Guardian",
"article:author":"http://www.guardian.co.uk/profile/nicholaswatt",
"article:modified_time":"2013-07-21T22:42:21+01:00",
"article:published_time":"2013-07-21T22:00:03+01:00",
"article:section":"Politics",
"article:tag":[
"Lynton Crosby",
"Health policy",
"NHS",
"Health",
"Healthcare industry",
"Society",
"Public services policy",
"Lobbying",
"Conservatives",
"David Cameron",
"Politics",
"UK news",
"Business"
],
"content-id":"/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
"dc:title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
"description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
"fb:app_id":180444840287,
"keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
"msapplication-TileColor":"#004983",
"msapplication-TileImage":"http://static.guim.co.uk/static/a314d63c616d4a06f5ec28ab4fa878a11a692a2a/common/images/favicons/windows_tile_144_b.png",
"news_keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
"og:description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
"og:image":"https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pixies/2013/7/21/1374433351329/Lynton-Crosby-008.jpg",
"og:site_name":"the Guardian",
"og:title":"Tory strategist Lynton Crosby in new lobbying row",
"og:type":"article",
"og:url":"http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
"resourceName":"tory-strategist-lynton-crosby-lobbying",
"title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
"twitter:app:id:googleplay":"com.guardian",
"twitter:app:id:iphone":409128287,
"twitter:app:name:googleplay":"The Guardian",
"twitter:app:name:iphone":"The Guardian",
"twitter:app:url:googleplay":"guardian://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
"twitter:card":"summary_large_image",
"twitter:site":"@guardian"
}