PHP 中的 HTML 抓取

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34120/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 21:14:53  来源:igfitidea点击:

HTML Scraping in Php

phphtmlscreen-scraping

提问by tsellon

I've been doing some HTML scraping in PHP using regular expressions. This works, but the result is finicky and fragile. Has anyone used any packages that provide a more robust solution? A config driven solution would be ideal, but I'm not picky.

我一直在使用正则表达式在 PHP 中进行一些 HTML 抓取。这有效,但结果是挑剔和脆弱的。有没有人使用过任何提供更强大解决方案的软件包?配置驱动的解决方案将是理想的,但我并不挑剔。

采纳答案by Espo

I would recomend PHP Simple HTML DOM Parserafter you have scraped the HTML from the page. It supports invalid HTML, and provides a very easy way to handle HTML elements.

在您从页面中抓取 HTML 后,我会推荐PHP Simple HTML DOM Parser。它支持无效的 HTML,并提供了一种非常简单的方法来处理 HTML 元素。

回答by John Douthat

If the page you're scraping is valid X(HT)ML, then any of PHP's built-in XML parserswill do.

如果您正在抓取的页面是有效的 X(HT)ML,那么任何PHP 的内置 XML 解析器都可以。

I haven't had much success with PHP libraries for scraping. If you're adventurous though, you can try simplehtmldom. I'd recommend Hpricotfor Ruby or Beautiful Soupfor Python, which are both excellentparsers for HTML.

我使用 PHP 库进行抓取并没有取得多大成功。如果你喜欢冒险,你可以尝试simplehtmldom。我推荐用于 Ruby 的Hpricot或用于 Python 的Beautiful Soup,它们都是出色的 HTML 解析器。

回答by BlaM

I had some fun working with htmlSQL, which is not so much a high end solution, but really simple to work with.

我在使用htmlSQL 时玩得很开心,这不是一个高端解决方案,但使用起来非常简单。

回答by BlaM

I would also recommend 'Simple HTML DOM Parser.' It is a good option particularly if your familiar with jQuery or JavaScript selectors then you will find yourself at home.

我还推荐“Simple HTML DOM Parser”。这是一个不错的选择,特别是如果您熟悉 jQuery 或 JavaScript 选择器,那么您会发现自己很自在。

I have even blogged about it in the past.

我什至在过去写过关于它的博客。

回答by datasn.io

Using PHP for HTML scraping, I'd recommend cURL + regexp or cURL + some DOM parsers though I personally use cURL + regexp. If you have a profound taste of regexp, it's actually more accurate sometimes.

使用 PHP 进行 HTML 抓取,我建议使用 cURL + regexp 或 cURL + 一些 DOM 解析器,尽管我个人使用 cURL + regexp。如果您对正则表达式有深刻的了解,那么它有时实际上更准确。

回答by Steve

I had to use curl on my host 1and1.

我不得不在我的主机 1and1 上使用 curl。

http://www.quickscrape.com/is what I came up with using the Simple DOM class!

http://www.quickscrape.com/是我使用 Simple DOM 类想到的!

回答by Jan Gorman

I've had very good with results with the Simple Html DOM Parsermentioned above as well. And then there's the ?tidy Extension for PHPas well which works really well too.

我也用上面提到的Simple Html DOM Parser获得了很好的结果。然后是?PHP 的 tidy 扩展也很好用。