php 如何使用php从html中提取img src、title和alt?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/138313/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract img src, title and alt from html using php?
提问by Sam
I would like to create a page where all images which reside on my website are listed with title and alternative representation.
我想创建一个页面,其中列出了驻留在我网站上的所有图像以及标题和替代表示。
I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, titleand altfrom this HTML:
我已经写了一个小程序来查找和加载所有 HTML 文件,但现在我被困在如何提取src,title以及alt从这个 HTML 中:
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
I guess this should be done with some regex, but since the order of the tags may vary, and I need all of them, I don't really know how to parse this in an elegant way (I could do it the hard char by char way, but that's painful).
我想这应该用一些正则表达式来完成,但由于标签的顺序可能会有所不同,而且我需要所有这些,我真的不知道如何以优雅的方式解析它(我可以通过char 方式,但这很痛苦)。
回答by karim
$url="http://example.com";
$html = file_get_contents($url);
$doc = new DOMDocument();
@$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo $tag->getAttribute('src');
}
回答by e-satis
EDIT : now that I know better
编辑:现在我知道得更好了
Using regexp to solve this kind of problem is a bad ideaand will likely lead in unmaintainable and unreliable code. Better use an HTML parser.
使用正则表达式来解决这类问题是一个坏主意,并且可能会导致无法维护和不可靠的代码。最好使用HTML 解析器。
Solution With regexp
正则表达式解决方案
In that case it's better to split the process into two parts :
在这种情况下,最好将过程分为两部分:
- get all the img tag
- extract their metadata
- 获取所有 img 标签
- 提取他们的元数据
I will assume your doc is not xHTML strict so you can't use an XML parser. E.G. with this web page source code :
我假设您的文档不是 xHTML 严格的,因此您不能使用 XML 解析器。EG 与此网页源代码:
/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */
preg_match_all('/<img[^>]+>/i',$html, $result);
print_r($result);
Array
(
[0] => Array
(
[0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
[1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
[3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
[4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[...]
)
)
Then we get all the img tag attributes with a loop :
然后我们通过循环获取所有 img 标签属性:
$img = array();
foreach( $result as $img_tag)
{
preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}
print_r($img);
Array
(
[<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
(
[0] => Array
(
[0] => src="/Content/Img/stackoverflow-logo-250.png"
[1] => alt="logo link to homepage"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "/Content/Img/stackoverflow-logo-250.png"
[1] => "logo link to homepage"
)
)
[<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-up.png"
[1] => alt="vote up"
[2] => title="This was helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-up.png"
[1] => "vote up"
[2] => "This was helpful (click again to undo)"
)
)
[<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-down.png"
[1] => alt="vote down"
[2] => title="This was not helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-down.png"
[1] => "vote down"
[2] => "This was not helpful (click again to undo)"
)
)
[<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
(
[0] => Array
(
[0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => alt="gravatar image"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => "gravatar image"
)
)
[..]
)
)
Regexps are CPU intensive so you may want to cache this page. If you have no cache system, you can tweak your own by using ob_startand loading / saving from a text file.
正则表达式是 CPU 密集型的,因此您可能需要缓存此页面。如果您没有缓存系统,您可以通过使用ob_start并从文本文件加载/保存来调整自己的缓存系统。
How does this stuff work ?
这个东西是如何工作的?
First, we use preg_ match_ all, a function that gets every string matching the pattern and ouput it in it's third parameter.
首先,我们使用preg_ match_all,该函数获取与模式匹配的每个字符串并将其输出到它的第三个参数中。
The regexps :
正则表达式:
<img[^>]+>
We apply it on all html web pages. It can be read as every string that starts with "<img", contains non ">" char and ends with a >.
我们将其应用于所有 html 网页。它可以读作每个以“ <img”开头、包含非“>”字符并以“>”结尾的字符串。
(alt|title|src)=("[^"]*")
We apply it successively on each img tag. It can be read as every string starting with "alt", "title" or "src", then a "=", then a ' " ', a bunch of stuff that are not ' " ' and ends with a ' " '. Isolate the sub-strings between ().
我们将它依次应用于每个 img 标签。它可以读作以“alt”、“title”或“src”开头的每个字符串,然后是“=”,然后是“””,一堆不是“””并以“”结尾的东西. 隔离 () 之间的子字符串。
Finally, every time you want to deal with regexps, it handy to have good tools to quickly test them. Check this online regexp tester.
最后,每次你想处理正则表达式时,有好的工具来快速测试它们是很方便的。检查这个在线正则表达式测试器。
EDIT : answer to the first comment.
编辑:回答第一条评论。
It's true that I did not think about the (hopefully few) people using single quotes.
确实,我没有想到(希望很少)使用单引号的人。
Well, if you use only ', just replace all the " by '.
好吧,如果您只使用 ',只需将所有的 " 替换为 '。
If you mix both. First you should slap yourself :-), then try to use ("|') instead or " and [^?] to replace [^"].
如果两者混合。首先你应该给自己打耳光:-),然后尝试使用 ("|') 代替或 " 和 [^?] 来代替 [^"]。
回答by Stefan Gehrig
Just to give a small example of using PHP's XML functionality for the task:
只是举一个使用 PHP 的 XML 功能来完成任务的小例子:
$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}
I did use the DOMDocument::loadHTML()method because this method can cope with HTML-syntax and does not force the input document to be XHTML. Strictly speaking the conversion to a SimpleXMLElementis not necessary - it just makes using xpath and the xpath results more simple.
我确实使用了该DOMDocument::loadHTML()方法,因为该方法可以处理 HTML 语法,并且不会强制输入文档为 XHTML。严格地说,转换为 aSimpleXMLElement是不必要的 - 它只是使使用 xpath 和 xpath 结果更简单。
回答by DreamWerx
If it's XHTML, your example is, you need only simpleXML.
如果它是 XHTML,你的例子是,你只需要 simpleXML。
<?php
$input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>';
$sx = simplexml_load_string($input);
var_dump($sx);
?>
Output:
输出:
object(SimpleXMLElement)#1 (1) {
["@attributes"]=>
array(3) {
["src"]=>
string(22) "/image/fluffybunny.jpg"
["title"]=>
string(16) "Harvey the bunny"
["alt"]=>
string(26) "a cute little fluffy bunny"
}
}
回答by Bakudan
The script must be edited like this
脚本必须像这样编辑
foreach( $result[0] as $img_tag)
foreach( $result[0] as $img_tag)
because preg_match_all return array of arrays
因为 preg_match_all 返回数组数组
回答by Nauphal
You may use simplehtmldom. Most of the jQuery selectors are supported in simplehtmldom. An example is given below
您可以使用simplehtmldom。simplehtmldom 支持大多数 jQuery 选择器。下面给出一个例子
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
回答by WNRosenberg
I used preg_match to do it.
我使用 preg_match 来做到这一点。
In my case, I had a string containing exactly one <img>tag (and no other markup) that I got from Wordpress and I was trying to get the srcattribute so I could run it through timthumb.
就我而言,我有一个字符串,其中包含<img>我从 Wordpress 获得的一个标签(没有其他标记),我试图获取该src属性,以便我可以通过 timthumb 运行它。
// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);
// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);
In the pattern to grab the title or the alt, you could simply use $pattern = '/title="([^"]*)"/';to grab the title or $pattern = '/title="([^"]*)"/';to grab the alt. Sadly, my regex isn't good enough to grab all three (alt/title/src) with one pass though.
在抓取标题或 alt 的模式中,您可以简单地使用$pattern = '/title="([^"]*)"/';抓取标题或$pattern = '/title="([^"]*)"/';抓取 alt。可悲的是,我的正则表达式不够好,无法一次性获取所有三个(alt/title/src)。
回答by mickmackusa
I have read the many comments on this page that complain that using a dom parser is unnecessary overhead. Well, it may be more expensive than a mere regex call, but the OP has stated that there is no control over the order of the attributes in the img tags. This fact leads to unnecessary regex pattern convolution. Beyond that, using a dom parser provides the additional benefits of readability, maintainability, and dom-awareness (regex is not dom-aware).
我已经阅读了这个页面上的许多评论,抱怨使用 dom 解析器是不必要的开销。好吧,它可能比单纯的正则表达式调用更昂贵,但 OP 已经声明无法控制 img 标签中属性的顺序。这一事实导致不必要的正则表达式模式卷积。除此之外,使用 dom 解析器提供了可读性、可维护性和 dom 感知(regex 不感知 dom)的额外好处。
I love regex and I answer lots of regex questions, but when dealing with valid HTML there is seldom a good reason to regex over a parser.
我喜欢 regex,我回答了很多 regex 问题,但是在处理有效的 HTML 时,很少有充分的理由在解析器上使用 regex。
In the demonstration below, see how easy and clean DOMDocument handles img tag attributes in any order with a mixture of quoting (and no quoting at all). Also notice that tags without a targeted attribute are not disruptive at all -- an empty string is provided as a value.
在下面的演示中,看看 DOMDocument 如何以任何顺序处理 img 标签属性并混合引用(并且根本不引用)是多么简单和干净。另请注意,没有目标属性的标签根本没有破坏性——提供一个空字符串作为值。
Code: (Demo)
代码:(演示)
$test = <<<HTML
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
<img src='/image/pricklycactus.jpg' title='Roger the cactus' alt='a big green prickly cactus' />
<p>This is irrelevant text.</p>
<img alt="an annoying white cockatoo" title="Polly the cockatoo" src="/image/noisycockatoo.jpg">
<img title=something src=somethingelse>
HTML;
libxml_use_internal_errors(true); // silences/forgives complaints from the parser (remove to see what is generated)
$dom = new DOMDocument();
$dom->loadHTML($test);
foreach ($dom->getElementsByTagName('img') as $i => $img) {
echo "IMG#{$i}:\n";
echo "\tsrc = " , $img->getAttribute('src') , "\n";
echo "\ttitle = " , $img->getAttribute('title') , "\n";
echo "\talt = " , $img->getAttribute('alt') , "\n";
echo "---\n";
}
Output:
输出:
IMG#0:
src = /image/fluffybunny.jpg
title = Harvey the bunny
alt = a cute little fluffy bunny
---
IMG#1:
src = /image/pricklycactus.jpg
title = Roger the cactus
alt = a big green prickly cactus
---
IMG#2:
src = /image/noisycockatoo.jpg
title = Polly the cockatoo
alt = an annoying white cockatoo
---
IMG#3:
src = somethingelse
title = something
alt =
---
Using this technique in professional code will leave you with a clean script, fewer hiccups to contend with, and fewer colleagues that wish you worked somewhere else.
在专业代码中使用这种技术将使您拥有干净的脚本,更少的打嗝,以及希望您在其他地方工作的同事更少。
回答by John Daliani
Here's A PHP Function I hobbled together from all of the above info for a similar purpose, namely adjusting image tag width and length properties on the fly ... a bit clunky, perhaps, but seems to work dependably:
这是一个 PHP 函数,我从上述所有信息中收集到一个类似的目的,即动态调整图像标签的宽度和长度属性......也许有点笨重,但似乎工作可靠:
function ReSizeImagesInHTML($HTMLContent,$MaximumWidth,$MaximumHeight) {
// find image tags
preg_match_all('/<img[^>]+>/i',$HTMLContent, $rawimagearray,PREG_SET_ORDER);
// put image tags in a simpler array
$imagearray = array();
for ($i = 0; $i < count($rawimagearray); $i++) {
array_push($imagearray, $rawimagearray[$i][0]);
}
// put image attributes in another array
$imageinfo = array();
foreach($imagearray as $img_tag) {
preg_match_all('/(src|width|height)=("[^"]*")/i',$img_tag, $imageinfo[$img_tag]);
}
// combine everything into one array
$AllImageInfo = array();
foreach($imagearray as $img_tag) {
$ImageSource = str_replace('"', '', $imageinfo[$img_tag][2][0]);
$OrignialWidth = str_replace('"', '', $imageinfo[$img_tag][2][1]);
$OrignialHeight = str_replace('"', '', $imageinfo[$img_tag][2][2]);
$NewWidth = $OrignialWidth;
$NewHeight = $OrignialHeight;
$AdjustDimensions = "F";
if($OrignialWidth > $MaximumWidth) {
$diff = $OrignialWidth-$MaximumHeight;
$percnt_reduced = (($diff/$OrignialWidth)*100);
$NewHeight = floor($OrignialHeight-(($percnt_reduced*$OrignialHeight)/100));
$NewWidth = floor($OrignialWidth-$diff);
$AdjustDimensions = "T";
}
if($OrignialHeight > $MaximumHeight) {
$diff = $OrignialHeight-$MaximumWidth;
$percnt_reduced = (($diff/$OrignialHeight)*100);
$NewWidth = floor($OrignialWidth-(($percnt_reduced*$OrignialWidth)/100));
$NewHeight= floor($OrignialHeight-$diff);
$AdjustDimensions = "T";
}
$thisImageInfo = array('OriginalImageTag' => $img_tag , 'ImageSource' => $ImageSource , 'OrignialWidth' => $OrignialWidth , 'OrignialHeight' => $OrignialHeight , 'NewWidth' => $NewWidth , 'NewHeight' => $NewHeight, 'AdjustDimensions' => $AdjustDimensions);
array_push($AllImageInfo, $thisImageInfo);
}
// build array of before and after tags
$ImageBeforeAndAfter = array();
for ($i = 0; $i < count($AllImageInfo); $i++) {
if($AllImageInfo[$i]['AdjustDimensions'] == "T") {
$NewImageTag = str_ireplace('width="' . $AllImageInfo[$i]['OrignialWidth'] . '"', 'width="' . $AllImageInfo[$i]['NewWidth'] . '"', $AllImageInfo[$i]['OriginalImageTag']);
$NewImageTag = str_ireplace('height="' . $AllImageInfo[$i]['OrignialHeight'] . '"', 'height="' . $AllImageInfo[$i]['NewHeight'] . '"', $NewImageTag);
$thisImageBeforeAndAfter = array('OriginalImageTag' => $AllImageInfo[$i]['OriginalImageTag'] , 'NewImageTag' => $NewImageTag);
array_push($ImageBeforeAndAfter, $thisImageBeforeAndAfter);
}
}
// execute search and replace
for ($i = 0; $i < count($ImageBeforeAndAfter); $i++) {
$HTMLContent = str_ireplace($ImageBeforeAndAfter[$i]['OriginalImageTag'],$ImageBeforeAndAfter[$i]['NewImageTag'], $HTMLContent);
}
return $HTMLContent;
}
回答by Xavier
Here is THE solution, in PHP:
这是解决方案,在 PHP 中:
Just download QueryPath, and then do as follows:
只需下载QueryPath,然后执行以下操作:
$doc= qp($myHtmlDoc);
foreach($doc->xpath('//img') as $img) {
$src= $img->attr('src');
$title= $img->attr('title');
$alt= $img->attr('alt');
}
That's it, you're done !
就是这样,你完成了!

