php 如何使用dom php解析器

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/960841/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 00:29:01  来源:igfitidea点击:

how to use dom php parser

phpdomhtml-parsing

提问by chris

I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:

我是 PHP 中 DOM 解析的新手:
我有一个要解析的 HTML 文件。它有一堆像这样的 DIV:

<div id="interestingbox"> 
   <div id="interestingdetails" class="txtnormal">
        <div>Content1</div>
        <div>Content2</div>
   </div>
</div>

<div id="interestingbox"> 
......

I'm trying to get the contents of the many div boxes using php. How can I use the DOM parser to do this?

我正在尝试使用 php 获取许多 div 框的内容。我怎样才能使用 DOM 解析器来做到这一点?

Thanks!

谢谢!

回答by apelliciari

First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.

首先我必须告诉你,你不能在两个不同的 div 上使用相同的 id;有针对这一点的课程。每个元素都应该有一个唯一的 id。

Code to get the contents of the div with id="interestingbox"

获取 id="interestingbox" div 内容的代码

$html = '
<html>
<head></head>
<body>
<div id="interestingbox"> 
   <div id="interestingdetails" class="txtnormal">
        <div>Content1</div>
        <div>Content2</div>
   </div>
</div>

<div id="interestingbox2"><a href="#">a link</a></div>
</body>
</html>';


$dom_document = new DOMDocument();

$dom_document->loadHTML($html);

//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);

// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[@id='interestingbox']");

if (!is_null($elements)) {

  foreach ($elements as $element) {
    echo "\n[". $element->nodeName. "]";

    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
      echo $node->nodeValue. "\n";
    }

  }
}

//OUTPUT
[div]  {
        Content1
        Content2
}

Example with classes:

类示例:

$html = '
<html>
<head></head>
<body>
<div class="interestingbox"> 
   <div id="interestingdetails" class="txtnormal">
        <div>Content1</div>
        <div>Content2</div>
   </div>
</div>

<div class="interestingbox"><a href="#">a link</a></div>
</body>
</html>';

//the same as before.. just change the xpath

[...]

$elements = $dom_xpath->query("*/div[@class='interestingbox']");

[...]

//OUTPUT
[div]  {
        Content1
        Content2
}

[div]  {
a link
}

Refer to the DOMXPathpage for more details.

有关更多详细信息,请参阅DOMXPath页面。

回答by chris

I got this to work using simplehtmldomas a start:

我使用simplehtmldom作为开始让它工作:

$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
    echo $result->innertext;
}

回答by giorgio79

Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue

来自http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue 的非常好的功能

function innerXML($node) 

{ 

    $doc  = $node->ownerDocument; 

    $frag = $doc->createDocumentFragment(); 

    foreach ($node->childNodes as $child) 

    { 

        $frag->appendChild($child->cloneNode(TRUE)); 

    } 

    return $doc->saveXML($frag); 

}  


$dom = new DOMDocument(); 

$dom->loadXML(' 

<html> 

<body> 

<table> 

<tr> 

    <td id="foo">  

        The first bit of Data I want 

        <br />The second bit of Data I want 

        <br />The third bit of Data I want 

    </td> 

</tr> 

</table> 

<body> 

<html> 



'); 

$xpath = new DOMXPath($dom); 

$node = $xpath->evaluate("/html/body//td[@id='foo' ]"); 

$dataString = innerXML($node->item(0)); 
$dataArr = explode("<br />", $dataString); 

$dataUno = $dataArr[0]; 
$dataDos = $dataArr[1]; 
$dataTres = $dataArr[2]; 

echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"  

回答by Oleksandr Knyga

WebExtractor: https://github.com/knyga/webextractorIt can parse page with css, regex, xpath selectors.

WebExtractor:https: //github.com/knyga/webextractor它可以使用 css、regex、xpath 选择器解析页面。

Look package and tests for examples:

查看包和测试示例:

use WebExtractor\DataExtractor\DataExtractorFactory; use WebExtractor\DataExtractor\DataExtractorTypes; use WebExtractor\Client\Client;

$factory = DataExtractorFactory::getFactory(); $extractor = $factory->createDataExtractor(DataExtractorTypes::CSS); $client = new Client; $content = $client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics'); $extractor->setContent($content); $h1 = $extractor->setSelector('h1')->extract();

使用 WebExtractor\DataExtractor\DataExtractorFactory; 使用 WebExtractor\DataExtractor\DataExtractorTypes; 使用 WebExtractor\Client\Client;

$factory = DataExtractorFactory::getFactory(); $extractor = $factory->createDataExtractor(DataExtractorTypes::CSS); $client = 新客户;$content = $client->get(' https://en.wikipedia.org/wiki/2014_Winter_Olympics'); $extractor->setContent($content); $h1 = $extractor->setSelector('h1')->extract();