如何在 PHP 中实现网络爬虫?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26947/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 21:13:41  来源:igfitidea点击:

How to implement a web scraper in PHP?

phpscreen-scraping

提问by Chaz Lever

What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?

哪些内置的 PHP 函数对网页抓取有用?有哪些好的资源(网络或印刷品)可以加快使用 PHP 进行网络抓取的速度?

回答by tyshock

Scraping generally encompasses 3 steps:

抓取一般包括 3 个步骤:

  • first you GET or POST your request to a specified URL
  • next you receive the html that is returned as the response
  • finally you parse out of that html the text you'd like to scrape.
  • 首先,您将请求 GET 或 POST 到指定的 URL
  • 接下来您会收到作为响应返回的 html
  • 最后,您从该 html 中解析出您想要抓取的文本。

To accomplish steps 1 and 2, below is a simple php class which uses Curl to fetch webpages using either GET or POST. After you get the HTML back, you just use Regular Expressions to accomplish step 3 by parsing out the text you'd like to scrape.

为了完成第 1 步和第 2 步,下面是一个简单的 php 类,它使用 Curl 使用 GET 或 POST 获取网页。取回 HTML 后,您只需使用正则表达式通过解析您想要抓取的文本来完成步骤 3。

For regular expressions, my favorite tutorial site is the following: Regular Expressions Tutorial

对于正则表达式,我最喜欢的教程站点如下: 正则表达式教程

My Favorite program for working with RegExs is Regex Buddy. I would advise you to try the demo of that product even if you have no intention of buying it. It is an invaluable tool and will even generate code for your regexs you make in your language of choice (including php).

我最喜欢使用 RegEx 的程序是Regex Buddy。即使您无意购买,我也建议您尝试该产品的演示。它是一个非常宝贵的工具,甚至可以为您使用您选择的语言(包括 php)制作的正则表达式生成代码。

Usage:

用法:

$curl = new Curl(); $html = $curl->get("http://www.google.com");

// now, do your regex work against $html

PHP Class:

PHP类:



<?php

class Curl
{       

    public $cookieJar = "";

    public function __construct($cookieJarFile = 'cookies.txt') {
        $this->cookieJar = $cookieJarFile;
    }

    function setup()
    {


        $header = array();
        $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
        $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] =  "Cache-Control: max-age=0";
        $header[] =  "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // browsers keep this blank.


        curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
        curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
        curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); 
        curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);
        curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
        curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);  
    }


    function get($url)
    { 
        $this->curl = curl_init($url);
        $this->setup();

        return $this->request();
    }

    function getAll($reg,$str)
    {
        preg_match_all($reg,$str,$matches);
        return $matches[1];
    }

    function postForm($url, $fields, $referer='')
    {
        $this->curl = curl_init($url);
        $this->setup();
        curl_setopt($this->curl, CURLOPT_URL, $url);
        curl_setopt($this->curl, CURLOPT_POST, 1);
        curl_setopt($this->curl, CURLOPT_REFERER, $referer);
        curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
        return $this->request();
    }

    function getInfo($info)
    {
        $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
        return $info;
    }

    function request()
    {
        return curl_exec($this->curl);
    }
}

?>

回答by Salman von Abbas

I recommend Goutte, a simple PHP Web Scraper.

我推荐Goutte,一个简单的 PHP Web Scraper

Example Usage:-

示例用法:-

Create a Goutte Client instance (which extends Symfony\Component\BrowserKit\Client):

创建一个 Goutte 客户端实例(扩展 Symfony\Component\BrowserKit\Client):

use Goutte\Client;

$client = new Client();

Make requests with the request()method:

使用以下request()方法发出请求:

$crawler = $client->request('GET', 'http://www.symfony-project.org/');

The requestmethod returns a Crawlerobject (Symfony\Component\DomCrawler\Crawler).

request方法返回一个Crawler对象 ( Symfony\Component\DomCrawler\Crawler)。

Click on links:

点击链接:

$link = $crawler->selectLink('Plugins')->link();
$crawler = $client->click($link);

Submit forms:

提交表格:

$form = $crawler->selectButton('sign in')->form();
$crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx'));

Extract data:

提取数据:

$nodes = $crawler->filter('.error_list');

if ($nodes->count())
{
  die(sprintf("Authentification error: %s\n", $nodes->text()));
}

printf("Nb tasks: %d\n", $crawler->filter('#nb_tasks')->text());

回答by Joe Niland

ScraperWikiis a pretty interesting project. Helps you build scrapers online in Python, Ruby or PHP - i was able to get a simple attempt up in a few minutes.

ScraperWiki是一个非常有趣的项目。帮助您使用 Python、Ruby 或 PHP 在线构建抓取工具 - 我能够在几分钟内进行简单的尝试。

回答by troelskn

If you need something that is easy to maintain, rather than fast to execute, it could help to use a scriptable browser, such as SimpleTest's.

如果您需要易于维护而不是快速执行的东西,使用可编写脚本的浏览器可能会有所帮助,例如SimpleTest 的.

回答by PHP Addict

Scraping can be pretty complex, depending on what you want to do. Have a read of this tutorial series on The Basics Of Writing A Scraper In PHPand see if you can get to grips with it.

抓取可能非常复杂,这取决于您想要做什么。阅读有关用 PHP 编写 Scraper 的基础知识的系列教程,看看您是否能掌握它。

You can use similar methods to automate form sign ups, logins, even fake clicking on Ads! The main limitations with using CURL though are that it doesn't support using javascript, so if you are trying to scrape a site that uses AJAX for pagination for example it can become a little tricky...but again there are ways around that!

您可以使用类似的方法来自动化表单注册、登录,甚至是虚假点击广告!使用 CURL 的主要限制是它不支持使用 javascript,所以如果你试图抓取一个使用 AJAX 进行分页的站点,例如它可能会变得有点棘手......但同样有办法解决这个问题!

回答by adiian

here is another one: a simple PHP Scraper without Regex.

这是另一个:一个没有 Regex的简单PHP Scraper

回答by Sarfraz

Scraper class from my framework:

我的框架中的爬虫类:

<?php

/*
    Example:

    $site = $this->load->cls('scraper', 'http://www.anysite.com');
    $excss = $site->getExternalCSS();
    $incss = $site->getInternalCSS();
    $ids = $site->getIds();
    $classes = $site->getClasses();
    $spans = $site->getSpans(); 

    print '<pre>';
    print_r($excss);
    print_r($incss);
    print_r($ids);
    print_r($classes);
    print_r($spans);        

*/

class scraper
{
    private $url = '';

    public function __construct($url)
    {
        $this->url = file_get_contents("$url");
    }

    public function getInternalCSS()
    {
        $tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getExternalCSS()
    {
        $tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getIds()
    {
        $tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getClasses()
    {
        $tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getSpans(){
        $tmp = preg_match_all('/(<span>)(.*)(<\/span>)/', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

}
?>

回答by Brian Warshaw

file_get_contents()can take a remote URL and give you the source. You can then use regular expressions (with the Perl-compatible functions) to grab what you need.

file_get_contents()可以获取远程 URL 并为您提供源。然后您可以使用正则表达式(使用 Perl 兼容函数)来获取您需要的内容。

Out of curiosity, what are you trying to scrape?

出于好奇,你想刮什么?

回答by dlamblin

I'd either use libcurl or Perl's LWP (libwww for perl). Is there a libwww for php?

我要么使用 libcurl 要么使用 Perl 的 LWP(perl 的 libwww)。是否有用于 php 的 libwww?

回答by Peter Stuifzand

The curl library allows you to download web pages. You should look into regular expressions for doing the scraping.

curl 库允许您下载网页。您应该查看用于抓取的正则表达式。