php 使用正则表达式从 html 代码中提取第一个图像源？

Question

提问by Ahmad Fouad

I would like to know how this can be achieved.

我想知道如何实现这一点。

Assume: That there's a lot of html code containing tables, divs, images, etc.

假设：有很多包含表格、div、图像等的 html 代码。

Problem: How can I get matches of all occurances. More over, to be specific, how can I get the img tag source (src = ?).

问题：我怎样才能得到所有出现的匹配。更重要的是，我如何获得 img 标签源（src = ？）。

example:

例子：

<img src="http://example.com/g.jpg" alt="" />

How can I print out http://example.com/g.jpgin this case. I want to assume that there are also other tags in the html code as i mentioned, and possibly more than one image. Would it be possible to have an array of all images sources in html code?

在这种情况下，我如何打印出http://example.com/g.jpg。我想假设我提到的 html 代码中还有其他标签，并且可能不止一张图片。是否可以在 html 代码中包含所有图像源的数组？

I know this can be achieved way or another with regular expressions, but I can't get the hang of it.

我知道这可以通过正则表达式实现，但我无法掌握它。

Any help is greatly appreciated.

任何帮助是极大的赞赏。

Answer 1

回答by Andrew Moore

While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.

虽然正则表达式适用于多种任务，但我发现它在解析 HTML DOM 时通常会达不到要求。HTML 的问题在于文档的结构如此多变，以至于很难准确地（准确地说，我的意思是 100% 成功率且没有误报）提取标签。

What I recommend you do is use a DOM parser such as SimpleHTMLand use it as such:

我建议您使用 DOM 解析器，例如SimpleHTML并使用它：

function get_first_image($html) {
    require_once('SimpleHTML.class.php')

    $post_html = str_get_html($html);

    $first_img = $post_html->find('img', 0);

    if($first_img !== null) {
        return $first_img->src;
    }

    return null;
}

Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can also get the alt attribute.

有些人可能认为这有点矫枉过正，但最终，它会更容易维护并允许更多的可扩展性。例如，使用 DOM 解析器，我还可以获取 alt 属性。

A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the altattribute to be after the srcor the opposite, and to overcome this limitation would add more complexity to the regular expression.

可以设计正则表达式来实现相同的目标，但会受到限制，强制alt属性位于srcthe之后或相反的位置，而克服此限制会增加正则表达式的复杂性。

Also, consider the following. To properly match an <img>tag using regular expressions and to get only the srcattribute (captured in group 2), you need the following regular expression:

另外，请考虑以下事项。要<img>使用正则表达式正确匹配标签并仅获取src属性（在组 2 中捕获），您需要以下正则表达式：

<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\?+.)*?)[^>]*?>

And then again, the above can fail if:

再说一次，如果出现以下情况，上述操作可能会失败：

The attribute or tag name is in capital and the imodifier is not used.
Quotes are not used around the srcattribute.
Another attribute then srcuses the >character somewhere in their value.
Some other reason I have not foreseen.

属性或标记名称以大写i字母表示，并且不使用修饰符。
src属性周围不使用引号。
然后另一个属性在其值的某处src使用该>字符。
其他一些我没有预见到的原因。

So again, simply don't use regular expressions to parse a dom document.

所以再次强调，不要使用正则表达式来解析 dom 文档。

EDIT:If you want all the images:

编辑：如果你想要所有的图像：

function get_images($html){
    require_once('SimpleHTML.class.php')

    $post_dom = str_get_dom($html);

    $img_tags = $post_dom->find('img');

    $images = array();

    foreach($img_tags as $image) {
        $images[] = $image->src;
    }

    return $images;
}

Answer 2

回答by inakiabt

Use this, is more effective:

使用这个，更有效：

preg_match_all('/<img [^>]*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
    echo $value."<br>";
}

Example:

例子：

$html = '
<ul>     
  <li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>       
  <li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>      
  <li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value1.jpg" />
  <li><a href="http://www.verot.net/pretty/">Electronaut Records</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value2.jpg" />
  <li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>     
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value3.jpg" />
</ul>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="res/upload.jpg" />
  <li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>       
  <li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>      
  <li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value4.jpg" />
  <li><a href="http://www.verot.net/pretty/">Electronaut Records</a></li>      
  <img src="value5.jpg" />
  <li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>     
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value6.jpg" />
';   
preg_match_all('/<img .*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
    echo $value."<br>";
}

Output:

输出：

value1.jpg
value2.jpg
value3.jpg
res/upload.jpg
value4.jpg
value5.jpg
value6.jpg

Answer 3

回答by ceejayoz

This works for me:

这对我有用：

preg_match('@<img.+src="(.*)".*>@Uims', $html, $matches);
$src = $matches[1];

Answer 4

回答by Nir Levy

i assume all your src= have " around the url

我假设你所有的 src= 在 url 周围都有 "

<img[^>]+src=\"([^\"]+)\"

the other answers posted here make other assumsions about your code

此处发布的其他答案对您的代码进行了其他假设

Answer 5

回答by Anjisan

I agree with Andrew Moore. Using the DOM is much, much better. The HTML DOM images collection will return to you a reference to all image objects.

我同意安德鲁摩尔的观点。使用 DOM 要好得多。HTML DOM 图像集合将返回对所有图像对象的引用。

Let's say in your header you have,

让我们在你的标题中说你有，

<script type="text/javascript">
    function getFirstImageSource()
    {
        var img = document.images[0].src;
        return img;
    }
</script>

and then in your body you have,

然后在你的身体里，

<script type="text/javascript">
  alert(getFirstImageSource());
</script>

This will return the 1st image source. You can also loop through them along the lines of, (in head section)

这将返回第一个图像源。你也可以沿着，（在头部部分）

function getAllImageSources()
    {
        var returnString = "";
        for (var i = 0; i < document.images.length; i++)
        {
            returnString += document.images[i].src + "\n"
        }
        return returnString;
    }

(in body)

（在体内）

<script type="text/javascript">
  alert(getAllImageSources());
</script>

If you're using JavaScript to do this, remember that you can't run your function looping through the images collection in your header. In other words, you can't do something like this,

如果您使用 JavaScript 来执行此操作，请记住，您无法在标头中的图像集合中循环运行您的函数。换句话说，你不能做这样的事情，

<script type="text/javascript">
    function getFirstImageSource()
    {
        var img = document.images[0].src;
        return img;
    }
    window.onload = getFirstImageSource;  //bad function

</script>

because this won't work. The images haven't loaded when the header is executed and thus you'll get a null result.

因为这行不通。执行标头时图像尚未加载，因此您将获得空结果。

Hopefully this can help in some way. If possible, I'd make use of the DOM. You'll find that a good deal of your work is already done for you.

希望这可以在某种程度上有所帮助。如果可能，我会使用 DOM。你会发现很多工作已经为你完成了。

Answer 6

回答by Anthony

I don't know if you MUST use regex to get your results. If not, you could try out simpleXML and XPath, which would be much more reliable for your goal:

我不知道您是否必须使用正则表达式来获得结果。如果没有，您可以尝试使用 simpleXML 和 XPath，这对您的目标来说更可靠：

First, import the HTML into a DOM Document Object. If you get errors, turn errors off for this part and be sure to turn them back on afterward:

首先，将 HTML 导入 DOM 文档对象。如果出现错误，请关闭此部分的错误，然后确保将其重新打开：

 $dom = new DOMDocument();
 $dom -> loadHTMLFile("filename.html");

Next, import the DOM into a simpleXML object, like so:

接下来，将 DOM 导入一个 simpleXML 对象，如下所示：

 $xml = simplexml_import_dom($dom);

Now you can use a few methods to get all of your image elements (and their attributes) into an array. XPath is the one I prefer, because I've had better luck with traversing the DOM with it:

现在您可以使用一些方法将所有图像元素（及其属性）放入一个数组中。XPath 是我更喜欢的，因为我用它遍历 DOM 的运气更好：

 $images = $xml -> xpath('//img/@src');

This variable now can treated like an array of your image URLs:

现在可以将此变量视为图像 URL 数组：

 foreach($images as $image) {
    echo '<img src="$image" /><br />
    ';
  }

Presto, all of your images, none of the fat.

Presto，你所有的图像，没有脂肪。

Here's the non-annotated version of the above:

这是上述内容的未注释版本：

 $dom = new DOMDocument();
 $dom -> loadHTMLFile("filename.html");

 $xml = simplexml_import_dom($dom);

 $images = $xml -> xpath('//img/@src');

 foreach($images as $image) {
    echo '<img src="$image" /><br />
    ';
  }

Answer 7

回答by arnaud-k

I really think you can not predict all the cases with on regular expression.

我真的认为您无法使用正则表达式预测所有情况。

The best way is to use the DOM with the PHP5 class DOMDocumentand xpath. It's the cleanest way to do what you want.

最好的方法是将 DOM 与PHP5 类 DOMDocument和 xpath 一起使用。这是做你想做的最干净的方式。

$dom = new DOMDocument();
$dom->loadHTML( $htmlContent );
$xml = simplexml_import_dom($dom);
$images = $xml -> xpath('//img/@src');

Answer 8

回答by dnagirl

since you're not worrying about validating the HTML, you might try using strip_tags()on the text first to clear out most of the cruft.

由于您不担心验证 HTML，因此您可以先尝试在文本上使用strip_tags()以清除大部分内容。

Then you can search for an expression like

然后你可以搜索这样的表达式

"/\<img .+ \/\>/i"

The backslashes escape special characters like <,>,/. .+ insists that there be 1 or more of any character inside the img tag You can capture part of the expression by putting parentheses around it. e.g. (.+) captures the middle part of the img tag.

反斜杠转义特殊字符，如 <、>、/。.+ 坚持在 img 标签内有 1 个或多个任意字符您可以通过在表达式周围放置括号来捕获部分表达式。例如 (.+) 捕获 img 标签的中间部分。

When you decide what part of the middle you wish specifically to capture, you can modify the (.+) to something more specific.

当您决定要专门捕获中间的哪个部分时，您可以将 (.+) 修改为更具体的内容。

Answer 9

回答by Allen Liu

You can try this:

你可以试试这个：

preg_match_all("/<img\s+src=\"(.+)\"/i", $html, $matches);
foreach ($matches as $key=>$value) {
    echo $key . ", " . $value . "<br>";
}

Answer 10

回答by Arpan Das

<?php    
/* PHP Simple HTML DOM Parser @ http://simplehtmldom.sourceforge.net */

require_once('simple_html_dom.php');

$html = file_get_html('http://example.com');
$image = $html->find('img')[0]->src;

echo "<img src='{$image}'/>"; // BOOM!

PHP Simple HTML DOM Parser will do the job in few lines of code.

PHP Simple HTML DOM Parser 只需几行代码即可完成这项工作。

php 使用正则表达式从 html 代码中提取第一个图像源？

提问by Ahmad Fouad

回答by Andrew Moore

回答by inakiabt

回答by ceejayoz

回答by Nir Levy

回答by Anjisan

回答by Anthony

回答by arnaud-k

回答by dnagirl

回答by Allen Liu

回答by Arpan Das

相关推荐

最近更新

标签

php 使用正则表达式从 html 代码中提取第一个图像源？

提问by Ahmad Fouad

回答by Andrew Moore

回答by inakiabt

回答by ceejayoz

回答by Nir Levy

回答by Anjisan

回答by Anthony

回答by arnaud-k

回答by dnagirl

回答by Allen Liu

回答by Arpan Das

相关推荐

使用 curl 和 php 从 ftp 下载文件

如何在 PHP 中的页面之间传递数据？

找不到 PHP 类

php bindParam 和 bindValue 有什么区别？

相关推荐

最近更新

标签