用 PHP 抓取网页
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9813273/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Web scraping in PHP
提问by federico-t
I'm looking for a way to make a small preview of another page from a URL given by the user in PHP.
我正在寻找一种方法来从用户在PHP 中给出的 URL 对另一个页面进行小预览。
I'd like to retrieve only the title of the page, an image (like the logo of the website) and a bit of text or a description if it's available. Is there any simple way to do this without any external libraries/classes? Thanks
我只想检索页面的标题、图像(如网站的徽标)和一些文本或描述(如果可用)。没有任何外部库/类,有没有简单的方法可以做到这一点?谢谢
So far I've tried using the DOCDocument class, loading the HTML and displaying it on the screen, but I don't think that's the proper way to do it
到目前为止,我已经尝试使用 DOCDocument 类,加载 HTML 并将其显示在屏幕上,但我认为这不是正确的方法
回答by Jordan Mack
I recommend you consider simple_html_domfor this. It will make it very easy.
我建议您为此考虑simple_html_dom。这将使它变得非常容易。
Here is a working example of how to pull the title, and first image.
这是一个如何提取标题和第一张图片的工作示例。
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://www.google.com/');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>
Here is a second example that will do the same without an external library. I should note that using regex on HTML is NOT a good idea.
这是在没有外部库的情况下执行相同操作的第二个示例。我应该注意到在 HTML 上使用正则表达式不是一个好主意。
<?php
$data = file_get_contents('http://www.google.com/');
preg_match('/<title>([^<]+)<\/title>/i', $data, $matches);
$title = $matches[1];
preg_match('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
$img = $matches[1];
echo $title."<br>\n";
echo $img;
?>
回答by Behrad Khodayar
You may use either of these libraries. As you know each one has pros & cons, so you may consult notes about each one or take time & try it on your own:
您可以使用这些库中的任何一个。如您所知,每种方法都有优点和缺点,因此您可以查阅有关每种方法的注释或花时间自己尝试:
- Guzzle: An Independent HTTP client, so no need to depend on cURL, SOAP or REST.
- Goutte: Built on Guzzle & some of Symfony components by Symfony developer.
- hQuery: A fast scraper with caching capabilities. high performance on scraping large docs.
- Requests: Famous for its user-friendly usage.
- Buzz: A lightweight client, ideal for beginners.
- ReactPHP: Async scraper, with comprehensive tutorials & examples.
- Guzzle: 一个独立的 HTTP 客户端,所以不需要依赖 cURL、SOAP 或 REST。
- Goutte:由 Symfony 开发人员基于 Guzzle 和一些 Symfony 组件构建。
- hQuery:具有缓存功能的快速抓取工具。抓取大型文档的高性能。
- 要求:以其用户友好的使用而闻名。
- Buzz:轻量级客户端,非常适合初学者。
- ReactPHP:异步抓取工具,提供全面的教程和示例。
You'd better check them all & use everyone in its best intended occasion.
你最好把它们都检查一遍,并在最好的场合使用每个人。
回答by Vijay Sharma
You can use SimpleHtmlDomfor this. and then look for the title and img tags or what ever else you need to do.
您可以为此使用SimpleHtmlDom。然后查找标题和 img 标签或您需要做的其他任何事情。
回答by forsberg
I like the Dom Crawlerlibrary. Very easy to use, has lots of options like:
我喜欢Dom Crawler库。非常易于使用,有很多选项,例如:
$crawler = $crawler
->filter('body > p')
->reduce(function (Crawler $node, $i) {
// filters every other node
return ($i % 2) == 0;
});