使用 Xpath 和 PHP 解析 HTML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13718500/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 05:56:49  来源:igfitidea点击:

Using Xpath with PHP to parse HTML

phpxpath

提问by VixenSoul

I'm currently trying to parse some data from a forum. Here is the code:

我目前正在尝试解析论坛中的一些数据。这是代码:

$xml = simplexml_load_file('https://forums.eveonline.com');

$names = $xml->xpath("html/body/div/div/form/div/div/div/div/div[*]/div/div/table//tr/td[@class='topicViews']");
foreach($names as $name) 
{
    echo $name . "<br/>";
}

Anyway, the problem is that I'm using google xpath extension to help me get the path, and I'm guessing that google is changing the html enough to make it not come up when i use my website to do this search. Is there some type of way I can make the host look at the site through google chrome so that it gets the right code? What would you suggest?

无论如何,问题是我正在使用 google xpath 扩展来帮助我获取路径,而且我猜测 google 正在更改 html 以使其在我使用我的网站进行此搜索时不会出现。有什么方法可以让主机通过谷歌浏览器查看网站,以便获得正确的代码?你有什么建议?

Thanks!

谢谢!

回答by Sherif

My suggestion is to always use DOMDocumentas opposed to SimpleXML, since it's a much nicer interface to work with and makes tasks a lot more intuitive.

我的建议是始终使用DOMDocument而不是 SimpleXML,因为它是一个更好的界面,使任务更加直观。

The following example shows you how to load the HTML into the DOMDocument object and query the DOM using XPath. All you really need to do is find all tdelements with a class name of topicViewsand this will output each of the nodeValuemembers found in the DOMNodeListreturned by this XPath query.

下面的示例向您展示了如何将 HTML 加载到 DOMDocument 对象中并使用 XPath 查询 DOM。您真正需要做的就是找到所有类名为topicViews 的td元素,这将输出在此 XPath 查询返回的DOMNodeList 中找到的每个nodeValue成员。

/* Use internal libxml errors -- turn on in production, off for debugging */
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
/* Load the HTML */
$dom->loadHTMLFile("https://forums.eveonline.com");
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <td> nodes containing specified class name */
$nodes = $xpath->query("//td[@class='topicViews']");
/* Set HTTP response header to plain text for debugging output */
header("Content-type: text/plain");
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($nodes as $i => $node) {
    echo "Node($i): ", $node->nodeValue, "\n";
}

回答by Damien Overeem

A double '/' will make xpath search. So if you would use the xpath '//table' you would get all tables. You can also use this deeper in your xpath structure like 'html/body/div/div/form//table' to get all tables under xpath 'html/body/div/div/form'.

双 '/' 将使 xpath 搜索。因此,如果您使用 xpath '//table',您将获得所有表格。您还可以在 xpath 结构中更深入地使用它,例如“html/body/div/div/form//table”,以获取 xpath“html/body/div/div/form”下的所有表格。

This way you can make your code a bit more resilient against changes in the html source.

通过这种方式,您可以使您的代码对 html 源代码中的更改更具弹性。

I do suggest learning a little about xpath if you want to use it. Copy paste only gets you so far.

如果你想使用它,我建议你学习一些关于 xpath 的知识。复制粘贴只能让你走到这一步。

A simple explanation about the syntax can be found at w3schools.com/xml/xpath_syntax.asp

关于语法的简单解释可以在 w3schools.com/xml/xpath_syntax.asp 找到