如何在 PHP 中解析 HTML?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18349130/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 17:23:05  来源:igfitidea点击:

How to parse HTML in PHP?

phphtmlparsingdom

提问by laradev

I know we can use PHP DOMto parse HTML using PHP. I found a lot of questions here on Stack Overflow too. But I have a specific requirement. I have an HTML content like below

我知道我们可以使用PHP DOM来解析使用 PHP 的 HTML。我在 Stack Overflow 上也发现了很多问题。但我有一个特定的要求。我有一个像下面这样的 HTML 内容

<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>

I want to parse the above HTML and save the content into two different arrays like:

我想解析上面的 HTML 并将内容保存到两个不同的数组中,例如:

$headingand $content

$heading$content

$heading = array('Chapter 1','Chapter 2','Chapter 3');
$content = array('This is chapter 1','This is chapter 2','This is chapter 3');

I can achieve this simply using jQuery. But I am not sure, if that's the right way. It would be great if someone can point me to the right direction. Thanks in advance.

我可以简单地使用 jQuery 来实现这一点。但我不确定,这是否是正确的方法。如果有人能指出我正确的方向,那就太好了。提前致谢。

回答by saji89

I have used domdocument and domxpath to get the solution, you can find it at:

我已经使用 domdocument 和 domxpath 来获得解决方案,您可以在以下位置找到它:

<?php
$dom = new DomDocument();
$test='<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>';

$dom->loadHTML($test);
$xpath = new DOMXpath($dom);
    $heading=parseToArray($xpath,'Heading1-H');
    $content=parseToArray($xpath,'Normal-H');

var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";

function parseToArray($xpath,$class)
{
    $xpathquery="//span[@class='".$class."']";
    $elements = $xpath->query($xpathquery);

    if (!is_null($elements)) {  
        $resultarray=array();
        foreach ($elements as $element) {
            $nodes = $element->childNodes;
            foreach ($nodes as $node) {
              $resultarray[] = $node->nodeValue;
            }
        }
        return $resultarray;
    }
}

Live result:http://saji89.codepad.org/2TyOAibZ

实时结果:http : //saji89.codepad.org/2TyOAibZ

回答by Paul Denisevich

Try to look at PHP Simple HTML DOM Parser

试试看PHP Simple HTML DOM Parser

It has brilliant syntax similar to jQuery so you can easily select any element you want by ID or class

它具有类似于 jQuery 的出色语法,因此您可以通过 ID 或类轻松选择所需的任何元素

// include/require the simple html dom parser file

$html_string = '
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
    if ($element->class === 'Heading1-H') {
        $heading[] = $element->innertext;
    }else if($element->class === 'Normal-H') {
        $content[] = $element->innertext;
    }
}

回答by Greeso

One option for you is to use DOMDocument and DOMXPath. They do require a bit of a curve to learn, but once you do, you will be pretty happy with what you can achieve.

一种选择是使用 DOMDocument 和 DOMXPath。他们确实需要一些曲线来学习,但是一旦你这样做了,你就会对你能取得的成就感到非常满意。

Read the following in php.net

在 php.net 中阅读以下内容

http://php.net/manual/en/class.domdocument.php

http://php.net/manual/en/class.domdocument.php

http://php.net/manual/en/class.domxpath.php

http://php.net/manual/en/class.domxpath.php

Hope this helps.

希望这可以帮助。

回答by jfraber

// Create DOM from URL or file

// 从 URL 或文件创建 DOM

$html = file_get_html('http://www.google.com/');

// Find all images

// 查找所有图像

foreach($html->find('img') as $element) 
   echo $element->src . '<br>';

// Find all links

// 查找所有链接

foreach($html->find('a') as $element) 
   echo $element->href . '<br>';