php 获取 PDF 文档的页数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/14644353/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get the number of pages in a PDF document
提问by Richard de Wit
This question is for referencing and comparing. The solution is the accepted answer below.
本题供参考和比较。解决方案是下面接受的答案。
Many hours have I searched for a fast and easy, but mostly accurate, way to get the number of pages in a PDF document. Since I work for a graphic printing and reproduction company that works a lot with PDFs, the number of pages in a document must be precisely known before they are processed. PDF documents come from many different clients, so they aren't generated with the same application and/or don't use the same compression method.
我花了很多时间寻找一种快速简单但大部分准确的方法来获取 PDF 文档中的页数。由于我在一家经常处理 PDF 的图形印刷和复制公司工作,因此在处理文档之前必须准确了解文档中的页数。PDF 文档来自许多不同的客户端,因此它们不是使用相同的应用程序生成的和/或不使用相同的压缩方法。
Here are some of the answers I found insufficientor simply NOT working:
以下是我发现不足或根本不起作用的一些答案:
Using Imagick(a PHP extension)
使用Imagick(一个 PHP 扩展)
Imagick requires a lot of installation, apache needs to restart, and when I finally had it working, it took amazingly long to process (2-3 minutes per document) and it always returned 1page in every document (haven't seen a working copy of Imagick so far), so I threw it away. That was with both the getNumberImages()and identifyImage()methods.
Imagick 需要大量安装,apache 需要重新启动,当我终于让它工作时,处理时间长得惊人(每个文档 2-3 分钟)并且它总是1在每个文档中返回页面(还没有看到工作副本)到目前为止的Imagick),所以我把它扔掉了。getNumberImages()和identifyImage()方法都是这样。
Using FPDI(a PHP library)
使用FPDI(一个 PHP 库)
FPDI is easy to use and install (just extract files and call a PHP script), BUTmany of the compression techniques are not supported by FPDI. It then returns an error:
FPDI 易于使用和安装(只需提取文件并调用 PHP 脚本),但FPDI 不支持许多压缩技术。然后它返回一个错误:
FPDF error: This document (test_1.pdf) probably uses a compression technique which is not supported by the free parser shipped with FPDI.
FPDF 错误:此文档 (test_1.pdf) 可能使用了 FPDI 附带的免费解析器不支持的压缩技术。
Opening a stream and search with a regular expression:
打开一个流并使用正则表达式进行搜索:
This opens the PDF file in a stream and searches for some kind of string, containing the pagecount or something similar.
这会在流中打开 PDF 文件并搜索某种字符串,其中包含页数或类似内容。
$f = "test1.pdf";
$stream = fopen($f, "r");
$content = fread ($stream, filesize($f));
if(!$stream || !$content)
    return 0;
$count = 0;
// Regular Expressions found by Googling (all linked to SO answers):
$regex  = "/\/Count\s+(\d+)/";
$regex2 = "/\/Page\W*(\d+)/";
$regex3 = "/\/N\s+(\d+)/";
if(preg_match_all($regex, $content, $matches))
    $count = max($matches);
return $count;
/\/Count\s+(\d+)/(looks for/Count <number>) doesn't work because only a few documents have the parameter/Countinside, so most of the time it doesn't return anything. Source./\/Page\W*(\d+)/(looks for/Page<number>) doesn't get the number of pages, mostly contains some other data. Source./\/N\s+(\d+)/(looks for/N <number>) doesn't work either, as the documents can contain multiple values of/N; most, if not all, notcontaining the pagecount. Source.
/\/Count\s+(\d+)/(looks for/Count <number>) 不起作用,因为只有少数文档/Count里面有参数,所以大多数时候它不返回任何东西。来源。/\/Page\W*(\d+)/(looks for/Page<number>) 没有得到页数,主要包含一些其他数据。来源。/\/N\s+(\d+)/(查找/N <number>) 也不起作用,因为文档可以包含/N; 的多个值。大多数(如果不是全部)不包含页数。来源。
So, what does work reliable and accurate?
那么,什么工作可靠又准确呢?
回答by Richard de Wit
A simple command line executable called: pdfinfo.
一个简单的命令行可执行文件,名为:pdfinfo。
It is downloadable for Linux and Windows. You download a compressed file containing several little PDF-related programs. Extract it somewhere.
它可用于 Linux 和 Windows 下载。您下载一个包含几个与 PDF 相关的小程序的压缩文件。将其提取到某处。
One of those files is pdfinfo(or pdfinfo.exefor Windows). An example of data returned by running it on a PDF document:
这些文件之一是pdfinfo(或Windows 的pdfinfo.exe)。通过在 PDF 文档上运行它返回的数据示例:
Title:          test1.pdf
Author:         John Smith
Creator:        PScript5.dll Version 5.2.2
Producer:       Acrobat Distiller 9.2.0 (Windows)
CreationDate:   01/09/13 19:46:57
ModDate:        01/09/13 19:46:57
Tagged:         yes
Form:           none
Pages:          13    <-- This is what we need
Encrypted:      no
Page size:      2384 x 3370 pts (A0)
File size:      17569259 bytes
Optimized:      yes
PDF version:    1.6
I haven't seen a PDF document where it returned a false pagecount (yet). It is also really fast, even with big documents of 200+ MB the response time is a just a few seconds or less.
我还没有看到返回错误页数的 PDF 文档(还没有)。它也非常快,即使是 200+ MB 的大文档,响应时间也只有几秒钟或更短。
There is an easy way of extracting the pagecount from the output, here in PHP:
有一种从输出中提取页数的简单方法,在 PHP 中:
// Make a function for convenience 
function getPDFPages($document)
{
    $cmd = "/path/to/pdfinfo";           // Linux
    $cmd = "C:\path\to\pdfinfo.exe";  // Windows
    // Parse entire output
    // Surround with double quotes if file name has spaces
    exec("$cmd \"$document\"", $output);
    // Iterate through lines
    $pagecount = 0;
    foreach($output as $op)
    {
        // Extract the number
        if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)
        {
            $pagecount = intval($matches[1]);
            break;
        }
    }
    return $pagecount;
}
// Use the function
echo getPDFPages("test 1.pdf");  // Output: 13
Of course this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.
当然这个命令行工具可以用在其他可以解析外部程序输出的语言中,但我在PHP中使用它。
I know its not pure PHP, but external programs are waybetter in PDF handling (as seen in the question).
我知道它不是纯 PHP,但外部程序在 PDF 处理方面更好(如问题所示)。
I hope this can help people, because I have spent a whole lot of time trying to find the solution to this and I have seen a lot of questions about PDF pagecount in which I didn't find the answer I was looking for. That's why I made this question and answered it myself.
我希望这可以帮助人们,因为我花了很多时间试图找到解决方案,而且我看到了很多关于 PDF 页数的问题,但我没有找到我想要的答案。这就是我提出这个问题并自己回答的原因。
回答by Kuldeep Dangi
Simplest of all is using ImageMagick
最简单的是使用ImageMagick
here is a sample code
这是一个示例代码
$image = new Imagick();
$image->pingImage('myPdfFile.pdf');
echo $image->getNumberImages();
otherwise you can also use PDFlibraries like MPDFor TCPDFfor PHP
否则,您也可以使用PDF类似MPDF或TCPDF用于的库PHP
回答by Muad'Dib
if you can't install any additional packages, you can use this simple one-liner:
如果你不能安装任何额外的包,你可以使用这个简单的单行:
foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*||p' | sort -rn | head -n 1)
回答by dhildreth
This seems to work pretty well, without the need for special packages or parsing command output.
这似乎工作得很好,不需要特殊的包或解析命令输出。
<?php                                                                               
$target_pdf = "multi-page-test.pdf";                                                
$cmd = sprintf("identify %s", $target_pdf);                                         
exec($cmd, $output);                                                                
$pages = count($output);
回答by Saran
If you have access to shell, a simplest (but not usable on 100% of PDFs) approach would be to use grep. 
如果您可以访问 shell,最简单的(但不适用于 100% 的 PDF)方法是使用grep.
This should return just the number of pages:
这应该只返回页数:
grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf
Example: https://regex101.com/r/BrUTKn/1
示例:https: //regex101.com/r/BrUTKn/1
Switches description:
开关说明:
-m 1is neccessary as some files can have more than one match of regex pattern (volonteer needed to replace this with match-only-first regex solution extension)-ais neccessary to treat the binary file as text-oto show only the match-Pto use Perl regular expression
-m 1是必要的,因为某些文件可以有多个匹配的正则表达式模式(志愿者需要将其替换为仅匹配第一正则表达式解决方案扩展名)-a有必要将二进制文件视为文本-o只显示匹配-P使用 Perl 正则表达式
Regex explanation:
正则表达式解释:
- starting "delimiter": 
(?<=\/N )lookbehind of/N(nb. space character not seen here) - actual result: 
\d+any number of digits - ending "delimiter": 
(?=\/)lookahead of/ 
- 开始“分隔符”:
(?<=\/N )后视/N(注意这里没有看到空格字符) - 实际结果:
\d+任意位数 - 结束“分隔符”:
(?=\/)先行/ 
Nota bene: if in some case match is not found, it's safe to assume only 1 page exists.
注意:如果在某些情况下找不到匹配项,则可以安全地假设只有 1 页存在。
回答by Franck Dernoncourt
回答by SuperNova
You can use qpdflike below. If a file file_name.pdf has 100 pages,
你可以qpdf像下面这样使用。如果文件 file_name.pdf 有 100 页,
$ qpdf --show-npages file_name.pdf
100
回答by james-geldart
I created a wrapper class for pdfinfo in case it's useful to anyone, based on Richard's answer@
我为 pdfinfo 创建了一个包装类,以防它对任何人都有用,基于 Richard 的回答@
/**
 * Wrapper for pdfinfo program, part of xpdf bundle
 * http://www.xpdfreader.com/about.html
 * 
 * this will put all pdfinfo output into keyed array, then make them accessible via getValue
 */
class PDFInfoWrapper {
    const PDFINFO_CMD = 'pdfinfo';
    /**
     * keyed array to hold all the info
     */
    protected $info = array();
    /**
     * raw output in case we need it
     */
    public $raw = "";
    /**
     * Constructor
     * @param string $filePath - path to file
     */
    public function __construct($filePath) {
        exec(self::PDFINFO_CMD . ' "' . $filePath . '"', $output);
        //loop each line and split into key and value
        foreach($output as $line) {
            $colon = strpos($line, ':');
            if($colon) {
                $key = trim(substr($line, 0, $colon));
                $val = trim(substr($line, $colon + 1));
                //use strtolower to make case insensitive
                $this->info[strtolower($key)] = $val;
            }
        }
        //store the raw output
        $this->raw = implode("\n", $output);
    }
    /**
     * get a value
     * @param string $key - key name, case insensitive
     * @returns string value
     */
    public function getValue($key) {
        return @$this->info[strtolower($key)];
    }
    /**
     * list all the keys
     * @returns array of key names
     */
    public function getAllKeys() {
        return array_keys($this->info);
    }
}
回答by Feiming Chen
Here is a Rfunction that reports the PDF file page number by using the pdfinfocommand. 
这是一个R使用pdfinfo命令报告PDF文件页码的函数。
pdf.file.page.number <- function(fname) {
    a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))
    page.number <- as.numeric(readLines(a))
    close(a)
    page.number
}
if (F) {
    pdf.file.page.number("a.pdf")
}
回答by commander
Here is a Windows command script using gsscript that reports the PDF file page number
这是使用 gsscript 报告 PDF 文件页码的 Windows 命令脚本
@echo off
echo.
rem
rem this file: getlastpagenumber.cmd
rem version 0.1 from commander 2015-11-03
rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/
rem Install path "C:\prg\ghostscript" for using the script without changes \ and have less problems with UAC
rem
:vars
  set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"
  set __lastpagenumber__=1
  set __pdffile__="%~1"
  set __pdffilename__="%~n1"
  set __datetime__=%date%%time%
  set __datetime__=%__datetime__:.=%
  set __datetime__=%__datetime__::=%
  set __datetime__=%__datetime__:,=%
  set __datetime__=%__datetime__:/=% 
  set __datetime__=%__datetime__: =% 
  set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"
:check
  if %__pdffile__%=="" goto error1
  if not exist %__pdffile__% goto error2
  if not exist %__gs__% goto error3
:main
  %__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE  -sstdout=%__tmpfile__%  %__pdffile__%
  FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A 
  set __lastpagenumber__=%__lastpagenumber__: =%
  if exist %__tmpfile__% del %__tmpfile__%
:output
  echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages
  goto end
:error1
  echo no pdf file selected
  echo usage: %~n0 PDFFILE
  goto end
:error2
  echo no pdf file found
  echo usage: %~n0 PDFFILE
  goto end
:error3
  echo.can not find the ghostscript bin file
  echo.   %__gs__%
  echo.please download it from:
  echo.   http://www.ghostscript.com/download/
  echo.and install to "C:\prg\ghostscript"
  goto end
:end
  exit /b

