用 PHP 抓取网页

Question

提问by federico-t

I'm looking for a way to make a small preview of another page from a URL given by the user in PHP.

我正在寻找一种方法来从用户在PHP 中给出的 URL 对另一个页面进行小预览。

I'd like to retrieve only the title of the page, an image (like the logo of the website) and a bit of text or a description if it's available. Is there any simple way to do this without any external libraries/classes? Thanks

我只想检索页面的标题、图像（如网站的徽标）和一些文本或描述（如果可用）。没有任何外部库/类，有没有简单的方法可以做到这一点？谢谢

So far I've tried using the DOCDocument class, loading the HTML and displaying it on the screen, but I don't think that's the proper way to do it

到目前为止，我已经尝试使用 DOCDocument 类，加载 HTML 并将其显示在屏幕上，但我认为这不是正确的方法

Answer 1

回答by Jordan Mack

I recommend you consider simple_html_domfor this. It will make it very easy.

我建议您为此考虑simple_html_dom。这将使它变得非常容易。

Here is a working example of how to pull the title, and first image.

这是一个如何提取标题和第一张图片的工作示例。

<?php
require 'simple_html_dom.php';

$html = file_get_html('http://www.google.com/');
$title = $html->find('title', 0);
$image = $html->find('img', 0);

echo $title->plaintext."<br>\n";
echo $image->src;
?>

Here is a second example that will do the same without an external library. I should note that using regex on HTML is NOT a good idea.

这是在没有外部库的情况下执行相同操作的第二个示例。我应该注意到在 HTML 上使用正则表达式不是一个好主意。

<?php
$data = file_get_contents('http://www.google.com/');

preg_match('/<title>([^<]+)<\/title>/i', $data, $matches);
$title = $matches[1];

preg_match('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
$img = $matches[1];

echo $title."<br>\n";
echo $img;
?>

Answer 2

回答by Behrad Khodayar

You may use either of these libraries. As you know each one has pros & cons, so you may consult notes about each one or take time & try it on your own:

您可以使用这些库中的任何一个。如您所知，每种方法都有优点和缺点，因此您可以查阅有关每种方法的注释或花时间自己尝试：

Guzzle: An Independent HTTP client, so no need to depend on cURL, SOAP or REST.
Goutte: Built on Guzzle & some of Symfony components by Symfony developer.
hQuery: A fast scraper with caching capabilities. high performance on scraping large docs.
Requests: Famous for its user-friendly usage.
Buzz: A lightweight client, ideal for beginners.
ReactPHP: Async scraper, with comprehensive tutorials & examples.

Guzzle: 一个独立的 HTTP 客户端，所以不需要依赖 cURL、SOAP 或 REST。
Goutte：由 Symfony 开发人员基于 Guzzle 和一些 Symfony 组件构建。
hQuery：具有缓存功能的快速抓取工具。抓取大型文档的高性能。
要求：以其用户友好的使用而闻名。
Buzz：轻量级客户端，非常适合初学者。
ReactPHP：异步抓取工具，提供全面的教程和示例。

You'd better check them all & use everyone in its best intended occasion.

你最好把它们都检查一遍，并在最好的场合使用每个人。

Answer 3

回答by Vijay Sharma

You can use SimpleHtmlDomfor this. and then look for the title and img tags or what ever else you need to do.

您可以为此使用SimpleHtmlDom。然后查找标题和 img 标签或您需要做的其他任何事情。

Answer 4

回答by forsberg

I like the Dom Crawlerlibrary. Very easy to use, has lots of options like:

我喜欢Dom Crawler库。非常易于使用，有很多选项，例如：

$crawler = $crawler
->filter('body > p')
->reduce(function (Crawler $node, $i) {
    // filters every other node
    return ($i % 2) == 0;
});

用 PHP 抓取网页

提问by federico-t

回答by Jordan Mack

回答by Behrad Khodayar

回答by Vijay Sharma

回答by forsberg

相关推荐

最近更新

标签

用 PHP 抓取网页

提问by federico-t

回答by Jordan Mack

回答by Behrad Khodayar

回答by Vijay Sharma

回答by forsberg

相关推荐

php 类 stdClass 的对象无法转换为字符串

php 在 Codeigniter 中使用 $this->db->get()->row_array()

在 PHP 中，如何递归删除所有非空文件夹？

php php中定义的空变量

相关推荐

最近更新

标签