如何在 PHP 中制作一个简单的爬虫?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2313107/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 06:00:05  来源:igfitidea点击:

How do I make a simple crawler in PHP?

phpweb-crawler

提问by Kshitij Saxena -KJ-

I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.

我有一个带有一堆链接的网页。我想编写一个脚本,将这些链接中包含的所有数据转储到本地文件中。

Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.

有人用PHP做过吗?一般准则和陷阱就足以作为答案。

回答by hobodave

Meh. Don't parse HTML with regexes.

嗯。不要用正则表达式解析 HTML

Here's a DOM version inspired by Tatu's:

这是一个受 Tatu 启发的 DOM 版本:

<?php
function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= dirname($parts['path'], 1).$path;
            }
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page("http://hobodave.com", 2);

Edit:I fixed some bugs from Tatu's version (works with relative URLs now).

编辑:我修复了 Tatu 版本中的一些错误(现在可以使用相对 URL)。

Edit:I added a new bit of functionality that prevents it from following the same URL twice.

编辑:我添加了一些新功能,可以防止它两次遵循相同的 URL。

Edit:echoing output to STDOUT now so you can redirect it to whatever file you want

编辑:现在将输出回显到 STDOUT,以便您可以将其重定向到您想要的任何文件

Edit:Fixed a bug pointed out by George in his answer. Relative urls will no longer append to the end of the url path, but overwrite it. Thanks to George for this. Note that George's answer doesn't account for any of: https, user, pass, or port. If you have the httpPECL extension loaded this is quite simply done using http_build_url. Otherwise, I have to manually glue together using parse_url. Thanks again George.

编辑:修正了乔治在他的回答中指出的一个错误。相对 url 将不再附加到 url 路径的末尾,而是覆盖它。感谢乔治为此。请注意,George 的回答不考虑以下任何一项:https、用户、通行证或端口。如果您加载了httpPECL 扩展,这很容易使用http_build_url完成。否则,我必须使用 parse_url 手动粘合在一起。再次感谢乔治。

回答by WonderLand

Here my implementation based on the above example/answer.

这是我基于上述示例/答案的实现。

  1. It is class based
  2. uses Curl
  3. support HTTP Auth
  4. Skip Url not belonging to the base domain
  5. Return Http header Response Code for each page
  6. Return time for each page
  1. 它是基于类的
  2. 使用卷曲
  3. 支持 HTTP 身份验证
  4. 跳过不属于基域的 URL
  5. 为每个页面返回 Http header Response Code
  6. 每页返回时间

CRAWL CLASS:

爬行类:

class crawler
{
    protected $_url;
    protected $_depth;
    protected $_host;
    protected $_useHttpAuth = false;
    protected $_user;
    protected $_pass;
    protected $_seen = array();
    protected $_filter = array();

    public function __construct($url, $depth = 5)
    {
        $this->_url = $url;
        $this->_depth = $depth;
        $parse = parse_url($url);
        $this->_host = $parse['host'];
    }

    protected function _processAnchors($content, $url, $depth)
    {
        $dom = new DOMDocument('1.0');
        @$dom->loadHTML($content);
        $anchors = $dom->getElementsByTagName('a');

        foreach ($anchors as $element) {
            $href = $element->getAttribute('href');
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($url, array('path' => $path));
                } else {
                    $parts = parse_url($url);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            // Crawl only link that belongs to the start domain
            $this->crawl_page($href, $depth - 1);
        }
    }

    protected function _getContent($url)
    {
        $handle = curl_init($url);
        if ($this->_useHttpAuth) {
            curl_setopt($handle, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
            curl_setopt($handle, CURLOPT_USERPWD, $this->_user . ":" . $this->_pass);
        }
        // follows 302 redirect, creates problem wiht authentication
//        curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);
        // return the content
        curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);

        /* Get the HTML or whatever is linked in $url. */
        $response = curl_exec($handle);
        // response total time
        $time = curl_getinfo($handle, CURLINFO_TOTAL_TIME);
        /* Check for 404 (file not found). */
        $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);

        curl_close($handle);
        return array($response, $httpCode, $time);
    }

    protected function _printResult($url, $depth, $httpcode, $time)
    {
        ob_end_flush();
        $currentDepth = $this->_depth - $depth;
        $count = count($this->_seen);
        echo "N::$count,CODE::$httpcode,TIME::$time,DEPTH::$currentDepth URL::$url <br>";
        ob_start();
        flush();
    }

    protected function isValid($url, $depth)
    {
        if (strpos($url, $this->_host) === false
            || $depth === 0
            || isset($this->_seen[$url])
        ) {
            return false;
        }
        foreach ($this->_filter as $excludePath) {
            if (strpos($url, $excludePath) !== false) {
                return false;
            }
        }
        return true;
    }

    public function crawl_page($url, $depth)
    {
        if (!$this->isValid($url, $depth)) {
            return;
        }
        // add to the seen URL
        $this->_seen[$url] = true;
        // get Content and Return Code
        list($content, $httpcode, $time) = $this->_getContent($url);
        // print Result for current Page
        $this->_printResult($url, $depth, $httpcode, $time);
        // process subPages
        $this->_processAnchors($content, $url, $depth);
    }

    public function setHttpAuth($user, $pass)
    {
        $this->_useHttpAuth = true;
        $this->_user = $user;
        $this->_pass = $pass;
    }

    public function addFilterPath($path)
    {
        $this->_filter[] = $path;
    }

    public function run()
    {
        $this->crawl_page($this->_url, $this->_depth);
    }
}

USAGE:

用法:

// USAGE
$startURL = 'http://YOUR_URL/';
$depth = 6;
$username = 'YOURUSER';
$password = 'YOURPASS';
$crawler = new crawler($startURL, $depth);
$crawler->setHttpAuth($username, $password);
// Exclude path with the following structure to be processed 
$crawler->addFilterPath('customer/account/login/referer');
$crawler->run();

回答by GeekTantra

Check out PHP Crawler

查看 PHP 爬虫

http://sourceforge.net/projects/php-crawler/

http://sourceforge.net/projects/php-crawler/

See if it helps.

看看它是否有帮助。

回答by Tatu Ulmanen

In it's simplest form:

最简单的形式:

function crawl_page($url, $depth = 5) {
    if($depth > 0) {
        $html = file_get_contents($url);

        preg_match_all('~<a.*?href="(.*?)".*?>~', $html, $matches);

        foreach($matches[1] as $newurl) {
            crawl_page($newurl, $depth - 1);
        }

        file_put_contents('results.txt', $newurl."\n\n".$html."\n\n", FILE_APPEND);
    }
}

crawl_page('http://www.domain.com/index.php', 5);

That function will get contents from a page, then crawl all found links and save the contents to 'results.txt'. The functions accepts an second parameter, depth, which defines how long the links should be followed. Pass 1 there if you want to parse only links from the given page.

该函数将从页面获取内容,然后抓取所有找到的链接并将内容保存到“results.txt”。这些函数接受第二个参数depth,它定义链接应该被跟踪多长时间。如果您只想解析给定页面中的链接,请在那里传递 1。

回答by Gordon

Why use PHP for this, when you can use wget, e.g.

为什么要为此使用 PHP,当您可以使用wget 时,例如

wget -r -l 1 http://www.example.com

For how to parse the contents, see Best Methods to parse HTMLand use the search function for examples. How to parse HTML has been answered multiple times before.

关于如何解析内容,请参见解析 HTML 的最佳方法和使用搜索功能的示例。如何解析 HTML 之前已经多次回答过。

回答by Team Webgalli

With some little changes to hobodave'scode, here is a codesnippet you can use to crawl pages. This needs the curl extension to be enabled in your server.

hobodave 的代码稍作改动,这里有一个代码片段,可用于抓取页面。这需要在您的服务器中启用 curl 扩展。

<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
$seen = array();
if(($depth == 0) or (in_array($url, $seen))){
    return;
}   
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
if( $result ){
    $stripped_file = strip_tags($result, "<a>");
    preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER ); 
    foreach($matches as $match){
        $href = $match[1];
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($href , array('path' => $path));
                } else {
                    $parts = parse_url($href);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            crawl_page($href, $depth - 1);
        }
}   
echo "Crawled {$href}";
}   
crawl_page("http://www.sitename.com/",3);
?>

I have explained this tutorial in this crawler script tutorial

我已经在这个爬虫脚本教程中解释了这个教程

回答by George

Hobodave you were very close. The only thing I have changed is within the if statement that checks to see if the href attribute of the found anchor tag begins with 'http'. Instead of simply adding the $url variable which would contain the page that was passed in you must first strip it down to the host which can be done using the parse_url php function.

Hobodave 你非常接近。我唯一改变的是在 if 语句中检查找到的锚标记的 href 属性是否以“http”开头。不是简单地添加包含传入页面的 $url 变量,您必须首先将其剥离到主机,这可以使用 parse_url php 函数完成。

<?php
function crawl_page($url, $depth = 5)
{
  static $seen = array();
  if (isset($seen[$url]) || $depth === 0) {
    return;
  }

  $seen[$url] = true;

  $dom = new DOMDocument('1.0');
  @$dom->loadHTMLFile($url);

  $anchors = $dom->getElementsByTagName('a');
  foreach ($anchors as $element) {
    $href = $element->getAttribute('href');
    if (0 !== strpos($href, 'http')) {
       /* this is where I changed hobodave's code */
        $host = "http://".parse_url($url,PHP_URL_HOST);
        $href = $host. '/' . ltrim($href, '/');
    }
    crawl_page($href, $depth - 1);
  }

  echo "New Page:<br /> ";
  echo "URL:",$url,PHP_EOL,"<br />","CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL,"  <br /><br />";
}

crawl_page("http://hobodave.com/", 5);
?>

回答by Jens Roland

As mentioned, there are crawler frameworks all ready for customizing out there, but if what you're doing is as simple as you mentioned, you could make it from scratch pretty easily.

如前所述,有一些爬虫框架都可以进行自定义,但是如果您正在做的事情像您提到的一样简单,那么您可以很容易地从头开始制作。

Scraping the links: http://www.phpro.org/examples/Get-Links-With-DOM.html

抓取链接:http: //www.phpro.org/examples/Get-Links-With-DOM.html

Dumping results to a file: http://www.tizag.com/phpT/filewrite.php

将结果转储到文件:http: //www.tizag.com/phpT/filewrite.php

回答by pasqal

I used @hobodave's code, with this little tweak to prevent re-crawling all fragment variants of the same URL:

我使用了@hobodave 的代码,通过这个小小的调整来防止重新抓取同一 URL 的所有片段变体:

<?php
function crawl_page($url, $depth = 5)
{
  $parts = parse_url($url);
  if(array_key_exists('fragment', $parts)){
    unset($parts['fragment']);
    $url = http_build_url($parts);
  }

  static $seen = array();
  ...

Then you can also omit the $parts = parse_url($url);line within the for loop.

然后你也可以省略$parts = parse_url($url);for 循环中的行。

回答by Niraj patel

You can try this it may be help to you

你可以试试这个它可能对你有帮助

$search_string = 'american golf News: Fowler beats stellar field in Abu Dhabi';
$html = file_get_contents(url of the site);
$dom = new DOMDocument;
$titalDom = new DOMDocument;
$tmpTitalDom = new DOMDocument;
libxml_use_internal_errors(true);
@$dom->loadHTML($html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$videos = $xpath->query('//div[@class="primary-content"]');
foreach ($videos as $key => $video) {
$newdomaindom = new DOMDocument;    
$newnode = $newdomaindom->importNode($video, true);
$newdomaindom->appendChild($newnode);
@$titalDom->loadHTML($newdomaindom->saveHTML());
$xpath1 = new DOMXPath($titalDom);
$titles = $xpath1->query('//div[@class="listingcontainer"]/div[@class="list"]');
if(strcmp(preg_replace('!\s+!',' ',  $titles->item(0)->nodeValue),$search_string)){     
    $tmpNode = $tmpTitalDom->importNode($video, true);
    $tmpTitalDom->appendChild($tmpNode);
    break;
}
}
echo $tmpTitalDom->saveHTML();