你如何解析和处理 PHP 中的 HTML/XML?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3577641/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 10:17:20  来源:igfitidea点击:

How do you parse and process HTML/XML in PHP?

phpxmlparsingxml-parsinghtml-parsing

提问by RobertPitt

How can one parse HTML/XML and extract information from it?

如何解析 HTML/XML 并从中提取信息?

采纳答案by Gordon

Native XML Extensions

原生 XML 扩展

I prefer using one of the native XML extensionssince they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.

我更喜欢使用其中一种原生 XML 扩展,因为它们与 PHP 捆绑在一起,通常比所有 3rd 方库都快,并为我提供了对标记所需的所有控制。

DOM

DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM 扩展允许您使用 PHP 5 通过 DOM API 操作 XML 文档。它是 W3C 文档对象模型核心级别 3 的实现,一个平台和语言中立的接口,允许程序和脚本动态访问和更新文件的内容、结构和样式。

DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.

DOM 能够解析和修改真实世界(损坏的)HTML,并且可以执行XPath 查询。它基于libxml

It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.

使用 DOM 需要一些时间来提高效率,但 IMO 这段时间非常值得。由于 DOM 是一种与语言无关的接口,您会发现多种语言的实现,因此如果您需要更改您的编程语言,那么您很可能已经知道如何使用该语言的 DOM API。

A basic usage example can be found in Grabbing the href attribute of an A elementand a general conceptual overview can be found at DOMDocument in php

抓取 A 元素的 href 属性中可以找到基本用法示例,在 php中的DOMDocument 中可以找到一般概念概述

How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.

如何使用 DOM 扩展已在 StackOverflow 上进行了广泛的介绍,因此如果您选择使用它,您可以确定您遇到的大多数问题都可以通过搜索/浏览 Stack Overflow 来解决。

XMLReader

XML阅读器

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader 扩展是一个 XML 拉式解析器。阅读器充当在文档流上前进并在途中的每个节点处停止的光标。

XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.

XMLReader 和 DOM 一样,也是基于 libxml 的。我不知道如何触发 HTML 解析器模块,因此使用 XMLReader 解析损坏的 HTML 的可能性可能不如使用 DOM 强,您可以在其中明确告诉它使用 libxml 的 HTML 解析器模块。

A basic usage example can be found at getting all values from h1 tags using php

可以在使用 php 从 h1 标签获取所有值时找到一个基本用法示例

XML Parser

XML 解析器

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

此扩展允许您创建 XML 解析器,然后为不同的 XML 事件定义处理程序。每个 XML 解析器还有一些可以调整的参数。

The XML Parser library is also based on libxml, and implements a SAXstyle XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.

XML Parser 库也基于 libxml,并实现了SAX风格的 XML 推送解析器。它可能是比 DOM 或 SimpleXML 更好的内存管理选择,但比 XMLReader 实现的拉式解析器更难使用。

SimpleXml

简单的XML

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

SimpleXML 扩展提供了一个非常简单且易于使用的工具集,用于将 XML 转换为可以使用普通属性选择器和数组迭代器处理的对象。

SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.

当您知道 HTML 是有效的 XHTML 时,SimpleXML 是一个选项。如果您需要解析损坏的 HTML,甚至不要考虑 SimpleXml,因为它会卡住。

A basic usage example can be found at A simple program to CRUD node and node values of xml fileand there is lots of additional examples in the PHP Manual.

基本用法示例可以在A simple program to CRUD node and node values of xml file中找到,PHP手册中很多其他示例



3rd Party Libraries (libxml based)

第 3 方库(基于 libxml)

If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxmlunderneath instead of string parsing.

如果您更喜欢使用 3rd-party lib,我建议您使用一个实际上在下面使用DOM/ libxml而不是字符串解析的 lib 。

FluentDom- Repo

FluentDom-回购

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

FluentDOM 为 PHP 中的 DOMDocument 提供了类似 jQuery 的流畅 XML 接口。选择器是用 XPath 或 CSS 编写的(使用 CSS 到 XPath 转换器)。当前版本扩展了 DOM 实现标准接口并添加了来自 DOM Living Standard 的功能。FluentDOM 可以加载 JSON、CSV、JsonML、RabbitFish 等格式。可以通过 Composer 安装。

HtmlPageDom

HtmlPageDom

Wa72\HtmlPageDom` is a PHP library for easy manipulation of HTML documents using It requires DomCrawler from Symfony2 componentsfor traversing the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents.

Wa72\HtmlPageDom` 是一个用于轻松操作 HTML 文档的 PHP 库,它需要来自 Symfony2 组件的 DomCrawler来遍历 DOM 树,并通过添加操作 HTML 文档的 DOM 树的方法对其进行扩展。

phpQuery(not updated for years)

phpQuery(多年未更新)

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).

phpQuery 是一个服务器端、可链接、CSS3 选择器驱动的文档对象模型 (DOM) API,基于用 PHP5 编写的 jQuery JavaScript 库,并提供额外的命令行界面 (CLI)。

Also see: https://github.com/electrolinux/phpquery

另见:https: //github.com/electrolinux/phpquery

Zend_Dom

Zend_Dom

Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

Zend_Dom 提供了处理 DOM 文档和结构的工具。目前,我们提供 Zend_Dom_Query,它提供了一个统一的接口,用于使用 XPath 和 CSS 选择器查询 DOM 文档。

QueryPath

查询路径

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.

QueryPath 是一个用于操作 XML 和 HTML 的 PHP 库。它不仅可以处理本地文件,还可以处理 Web 服务和数据库资源。它实现了大部分 jQuery 接口(包括 CSS 样式的选择器),但它针对服务器端使用进行了大量调整。可以通过 Composer 安装。

fDOMDocument

fDOM文档

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

fDOMDocument 扩展了标准 DOM 以在所有错误情况下使用异常,而不是 PHP 警告或通知。为了方便和简化 DOM 的使用,它们还添加了各种自定义方法和快捷方式。

sabre/xml

军刀/xml

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

sabre/xml 是一个库,它包装并扩展了 XMLReader 和 XMLWriter 类,以创建一个简单的“xml 到对象/数组”映射系统和设计模式。写入和读取 XML 是单程的,因此速度很快,并且在大型 xml 文件上需要的内存较少。

FluidXML

流体XML

FluidXML is a PHP library for manipulating XML with a concise and fluent API. It leverages XPath and the fluent programming pattern to be fun and effective.

FluidXML 是一个 PHP 库,用于使用简洁流畅的 API 操作 XML。它利用 XPath 和流畅的编程模式变得有趣和有效。



3rd-Party (not libxml-based)

3rd-Party(不是基于 libxml 的)

The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below

构建在 DOM/libxml 上的好处是,您可以获得良好的开箱即用性能,因为您基于本机扩展。然而,并不是所有的 3rd-party libs 都走这条路。其中一些列在下面

PHP Simple HTML DOM Parser

PHP 简单的 HTML DOM 解析器

  • An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.
  • 用 PHP5+ 编写的 HTML DOM 解析器可让您以非常简单的方式操作 HTML!
  • 需要 PHP 5+。
  • 支持无效的 HTML。
  • 使用选择器在 HTML 页面上查找标签,就像 jQuery 一样。
  • 在一行中从 HTML 中提取内容。

I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.

我一般不推荐这个解析器。代码库很糟糕,解析器本身很慢,而且很耗内存。并非所有 jQuery 选择器(例如子选择器)都可用。任何基于 libxml 的库都应该轻松胜过这一点。

PHP Html Parser

PHP Html 解析器

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrap html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

PHPHtmlParser 是一个简单、灵活的 html 解析器,它允许您使用任何 css 选择器(如 jQuery)来选择标签。目标是帮助开发需要快速、简单的方法来废弃 html 的工具,无论它是否有效!这个项目最初由 sunra/php-simple-html-dom-parser 支持,但似乎已经停止支持,所以这个项目是我对他以前工作的改编。

Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.

同样,我不会推荐这个解析器。CPU 使用率高时速度相当慢。也没有清除创建的 DOM 对象内存的功能。这些问题尤其适用于嵌套循环。文档本身不准确且拼写错误,自 16 年 4 月 14 日以来没有对修复程序的回复。

Ganon

加农

  • A universal tokenizer and HTML/XML/RSS DOM Parser
    • Ability to manipulate elements and their attributes
    • Supports invalid HTML and UTF8
  • Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported)
  • A HTML beautifier (like HTML Tidy)
    • Minify CSS and Javascript
    • Sort attributes, change character case, correct indentation, etc.
  • Extensible
    • Parsing documents using callbacks based on current character/token
    • Operations separated in smaller functions for easy overriding
  • Fast and Easy
  • 通用标记器和 HTML/XML/RSS DOM 解析器
    • 能够操作元素及其属性
    • 支持无效的 HTML 和 UTF8
  • 可以对元素执行类似 CSS3 的高级查询(如 jQuery -- 支持命名空间)
  • HTML 美化器(如 HTML Tidy)
    • 缩小 CSS 和 Javascript
    • 排序属性、更改字符大小写、正确缩进等。
  • 可扩展
    • 使用基于当前字符/令牌的回调来解析文档
    • 操作在较小的函数中分离,以便于覆盖
  • 快速简便

Never used it. Can't tell if it's any good.

从来没有用过。说不上好不好。



HTML 5

HTML 5

You can use the above for parsing HTML5, but there can be quirksdue to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like

你可以使用上面的来解析 HTML5,但由于 HTML5 允许的标记,可能会有一些怪癖。因此,对于 HTML5,您要考虑使用专用解析器,例如

html5lib

html5lib

A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

基于 WHATWG HTML5 规范的 HTML 解析器的 Python 和 PHP 实现,可最大程度地与主要桌面 Web 浏览器兼容。

We might see more dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled How-To for html 5 parsingthat is worth checking out.

一旦 HTML5 完成,我们可能会看到更多的专用解析器。W3 还有一篇名为How-To for html 5 parsing的博客文章,值得一看。



WebServices

网页服务

If you don't feel like programming PHP, you can also use Web services. In general, I found very little utility for these, but that's just me and my use cases.

如果您不想编写 PHP,也可以使用 Web 服务。总的来说,我发现这些用途很少,但这只是我和我的用例。

ScraperWiki.

刮板维基

ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.

ScraperWiki 的外部接口允许您以您希望在 Web 上或您自己的应用程序中使用的形式提取数据。您还可以提取有关任何刮刀状态的信息。



Regular Expressions

常用表达

Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.

最后也是最不推荐的,您可以使用正则表达式从 HTML 中提取数据。通常不鼓励在 HTML 上使用正则表达式。

Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.

您将在网络上找到的大多数与标记匹配的片段都很脆弱。在大多数情况下,它们仅适用于非常特殊的 HTML 片段。微小的标记更改,例如在某处添加空格,或添加或更改标签中的属性,都可能导致 RegEx 在编写不正确时失败。在 HTML 上使用 RegEx 之前,您应该知道自己在做什么。

HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.

HTML 解析器已经知道 HTML 的语法规则。必须为您编写的每个新 RegEx 教授正则表达式。RegEx 在某些情况下很好,但这实际上取决于您的用例。

You can write more reliable parsers, but writing a complete and reliablecustom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

可以编写更可靠的解析器,但是当上述库已经存在并且在这方面做得更好时,使用正则表达式编写完整且可靠的自定义解析器是浪费时间。

Also see Parsing Html The Cthulhu Way

另见解析 Html 克苏鲁方式



Books

图书

If you want to spend some money, have a look at

如果你想花一些钱,看看

I am not affiliated with PHP Architect or the authors.

我不隶属于 PHP 架构师或作者。

回答by Naveed

Try Simple HTML DOM Parser

尝试简单的 HTML DOM 解析器

  • A HTML DOM parser written in PHP 5+ that lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.
  • Download
  • 用 PHP 5+ 编写的 HTML DOM 解析器,可让您以非常简单的方式操作 HTML!
  • 需要 PHP 5+。
  • 支持无效的 HTML。
  • 使用选择器在 HTML 页面上查找标签,就像 jQuery 一样。
  • 在一行中从 HTML 中提取内容。
  • 下载



Examples:

例子:

How to get HTML elements:

如何获取 HTML 元素:

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';



How to modify HTML elements:

如何修改 HTML 元素:

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html;



Extract content from HTML:

从 HTML 中提取内容:

// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;



Scraping Slashdot:

抓取斜线点:

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

回答by Edward Z. Yang

Just use DOMDocument->loadHTML()and be done with it. libxml's HTML parsing algorithm is quite good and fast, and contrary to popular belief, does not choke on malformed HTML.

只需使用DOMDocument->loadHTML()并完成它。libxml 的 HTML 解析算法非常好且快速,并且与流行的看法相反,它不会因格式错误的 HTML 而窒息。

回答by mario

Why you shouldn't and when you shoulduse regular expressions?

为什么不应该以及何时应该使用正则表达式?

First off, a common misnomer: Regexps are not for "parsing"HTML. Regexes can however "extract"data. Extracting is what they're made for. The major drawback of regex HTML extraction over proper SGML toolkits or baseline XML parsers are their syntactic effort and varying reliability.

首先,一个常见的误称:正则表达式不是用于解析HTML。然而,正则表达式可以提取数据。提取是他们的目的。正则表达式 HTML 提取相对于适当的 SGML 工具包或基线 XML 解析器的主要缺点是它们的语法工作和不同的可靠性。

Consider that making a somewhat dependable HTML extraction regex:

考虑制作一个有点可靠的 HTML 提取正则表达式:

<a\s+class="?playbutton\d?[^>]+id="(\d+)".+?    <a\s+class="[\w\s]*title
[\w\s]*"[^>]+href="(http://[^">]+)"[^>]*>([^<>]+)</a>.+?

is way less readable than a simple phpQuery or QueryPath equivalent:

比一个简单的 phpQuery 或 QueryPath 等价物可读性差:

$div->find(".stationcool a")->attr("title");

There are however specific use cases where they can help.

然而,有一些特定的用例可以提供帮助。

  • Many DOM traversal frontends don't reveal HTML comments <!--, which however are sometimes the more useful anchors for extraction. In particular pseudo-HTML variations <$var>or SGML residues are easy to tame with regexps.
  • Oftentimes regular expressions can save post-processing. However HTML entities often require manual caretaking.
  • And lastly, for extremely simple taskslike extracting <img src= urls, they are in fact a probable tool. The speed advantage over SGML/XML parsers mostly just comes to play for these very basic extraction procedures.
  • 许多 DOM 遍历前端不显示 HTML 注释<!--,但有时它们是更有用的提取锚点。特别是伪 HTML 变体<$var>或 SGML 残留很容易用正则表达式驯服。
  • 通常,正则表达式可以节省后期处理。然而,HTML 实体通常需要手动处理。
  • 最后,对于像提取 <img src= urls 这样极其简单的任务,它们实际上是一个可能的工具。SGML/XML 解析器的速度优势主要体现在这些非常基本的提取过程中。

It's sometimes even advisable to pre-extract a snippet of HTML using regular expressions /<!--CONTENT-->(.+?)<!--END-->/and process the remainder using the simpler HTML parser frontends.

有时甚至建议使用正则表达式预提取 HTML 片段/<!--CONTENT-->(.+?)<!--END-->/并使用更简单的 HTML 解析器前端处理剩余部分。

Note:I actually have this app, where I employ XML parsing and regular expressions alternatively. Just last week the PyQuery parsing broke, and the regex still worked. Yes weird, and I can't explain it myself. But so it happened.
So please don't vote real-world considerations down, just because it doesn't match the regex=evil meme. But let's also not vote this up too much. It's just a sidenote for this topic.

注意:我实际上有这个应用程序,我在其中交替使用 XML 解析和正则表达式。就在上周,PyQuery 解析失败了,正则表达式仍然有效。是的,很奇怪,我自己也无法解释。但事情就这样发生了。
所以请不要因为现实世界的考虑不符合 regex=evil meme 就投反对票。但我们也不要过多地投票。这只是这个话题的一个旁注。

回答by mario

phpQueryand QueryPathare extremely similar in replicating the fluent jQuery API. That's also why they're two of the easiest approaches to properlyparse HTML in PHP.

phpQueryQueryPath在复制流畅的 jQuery API 方面非常相似。这也是为什么它们是在 PHP 中正确解析 HTML 的两种最简单方法的原因。

Examples for QueryPath

查询路径示例

Basically you first create a queryable DOM tree from an HTML string:

基本上,您首先从 HTML 字符串创建一个可查询的 DOM 树:

 $qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL

The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:

结果对象包含 HTML 文档的完整树形表示。可以使用 DOM 方法遍历它。但常见的方法是使用 CSS 选择器,就像在 jQuery 中一样:

 $qp->find("div.classname")->children()->...;

 foreach ($qp->find("p img") as $img) {
     print qp($img)->attr("src");
 }

Mostly you want to use simple #idand .classor DIVtag selectors for ->find(). But you can also use XPathstatements, which sometimes are faster. Also typical jQuery methods like ->children()and ->text()and particularly ->attr()simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)

主要是你想用简单的#id.classDIV标签选择器->find()。但是您也可以使用XPath语句,这有时会更快。还有典型的 jQuery 方法,比如->children()->text()并且特别->attr()简化了提取正确的 HTML 片段。(并且已经解码了他们的 SGML 实体。)

 $qp->xpath("//div/p[1]");  // get first paragraph in a div

QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).

QueryPath 还允许将新标签注入流 ( ->append),然后输出和美化更新的文档 ( ->writeHTML)。它不仅可以解析格式错误的 HTML,还可以解析各种 XML 方言(带有命名空间),甚至可以从 HTML 微格式(XFN、vCard)中提取数据。

 $qp->find("a[target=_blank]")->toggleClass("usability-blunder");

.

.

phpQuery or QueryPath?

phpQuery 还是 QueryPath?

Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because of fewer overall features).

通常 QueryPath 更适合操作文档。虽然 phpQuery 还实现了一些伪 AJAX 方法(只是 HTTP 请求)以更接近于 jQuery。据说 phpQuery 通常比 QueryPath 快(因为整体功能较少)。

For further information on the differences see this comparison on the wayback machine from tagbyte.org. (Original source went missing, so here's an internet archive link. Yes, you can still locate missing pages, people.)

有关差异的更多信息,请参阅tagbyte.org 上的回程机器上的比较。(原始来源丢失了,所以这里有一个互联网档案链接。是的,你仍然可以找到丢失的页面,人们。)

And here's a comprehensive QueryPath introduction.

这里有一个全面的 QueryPath 介绍

Advantages

好处

  • Simplicity and Reliability
  • Simple to use alternatives ->find("a img, a object, div a")
  • Proper data unescaping (in comparison to regular expression grepping)
  • 简单可靠
  • 简单易用的替代品 ->find("a img, a object, div a")
  • 正确的数据转义(与正则表达式 grepping 相比)

回答by Robert Elwell

Simple HTML DOM is a great open-source parser:

Simple HTML DOM 是一个很棒的开源解析器:

simplehtmldom.sourceforge

simplehtmldom.sourceforge

It treats DOM elements in an object-oriented way, and the new iteration has a lot of coverage for non-compliant code. There are also some great functions like you'd see in JavaScript, such as the "find" function, which will return all instances of elements of that tag name.

它以面向对象的方式处理 DOM 元素,新的迭代对不兼容的代码有很多覆盖。还有一些很棒的函数,就像您在 JavaScript 中看到的那样,例如“find”函数,它将返回该标签名称元素的所有实例。

I've used this in a number of tools, testing it on many different types of web pages, and I think it works great.

我已经在许多工具中使用了它,在许多不同类型的网页上对其进行了测试,我认为它非常有效。

回答by Eli

One general approach I haven't seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.

我在这里没有看到的一种通用方法是通过Tidy运行 HTML ,它可以设置为吐出保证有效的 XHTML。然后您可以在其上使用任何旧的 XML 库。

But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/-- it's a modified version of the Readabilityalgorithm, which is designed to extract just the textual content (not headers and footers) from a page.

但是对于您的具体问题,您应该看看这个项目:http: //fivefilters.org/content-only/——它是可读性算法的修改版本,旨在仅提取文本内容(而不是标题和页脚)从一个页面。

回答by Timo Haberkern

For 1a and 2: I would vote for the new Symfony Componet class DOMCrawler ( DomCrawler). This class allows queries similar to CSS Selectors. Take a look at this presentation for real-world examples: news-of-the-symfony2-world.

对于 1a 和 2:我会投票支持新的 Symfony 组件类 DOMCrawler ( DomCrawler)。此类允许类似于 CSS 选择器的查询。查看此演示文稿中的真实示例:news-of-the-symfony2-world

The component is designed to work standalone and can be used without Symfony.

该组件被设计为独立工作,可以在没有 Symfony 的情况下使用。

The only drawback is that it will only work with PHP 5.3 or newer.

唯一的缺点是它只适用于 PHP 5.3 或更高版本。

回答by Joel Verhagen

This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser.

顺便说一下,这通常称为屏幕抓取。我为此使用的库是Simple HTML Dom Parser

回答by jancha

We have created quite a few crawlers for our needs before. At the end of the day, it is usually simple regular expressions that do the thing best. While libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is a safer way to go, as you can handle also non-valid HTML/XHTMLstructures, which would fail, if loaded via most of the parsers.

我们之前已经为我们的需求创建了很多爬虫。归根结底,通常是简单的正则表达式最能发挥作用。虽然上面列出的库因其创建的原因而很好,但如果您知道自己在寻找什么,正则表达式是一种更安全的方法,因为您还可以处理无效的HTML/ XHTML结构,如果加载这些结构就会失败通过大多数解析器。