php 获取 PDF 文档的页数

Question

提问by Richard de Wit

This question is for referencing and comparing. The solution is the accepted answer below.

本题供参考和比较。解决方案是下面接受的答案。

Many hours have I searched for a fast and easy, but mostly accurate, way to get the number of pages in a PDF document. Since I work for a graphic printing and reproduction company that works a lot with PDFs, the number of pages in a document must be precisely known before they are processed. PDF documents come from many different clients, so they aren't generated with the same application and/or don't use the same compression method.

我花了很多时间寻找一种快速简单但大部分准确的方法来获取 PDF 文档中的页数。由于我在一家经常处理 PDF 的图形印刷和复制公司工作，因此在处理文档之前必须准确了解文档中的页数。PDF 文档来自许多不同的客户端，因此它们不是使用相同的应用程序生成的和/或不使用相同的压缩方法。

Here are some of the answers I found insufficientor simply NOT working:

以下是我发现不足或根本不起作用的一些答案：

Using Imagick(a PHP extension)

使用Imagick（一个 PHP 扩展）

Imagick requires a lot of installation, apache needs to restart, and when I finally had it working, it took amazingly long to process (2-3 minutes per document) and it always returned 1page in every document (haven't seen a working copy of Imagick so far), so I threw it away. That was with both the getNumberImages()and identifyImage()methods.

Imagick 需要大量安装，apache 需要重新启动，当我终于让它工作时，处理时间长得惊人（每个文档 2-3 分钟）并且它总是1在每个文档中返回页面（还没有看到工作副本）到目前为止的Imagick），所以我把它扔掉了。getNumberImages()和identifyImage()方法都是这样。

Using FPDI(a PHP library)

使用FPDI（一个 PHP 库）

FPDI is easy to use and install (just extract files and call a PHP script), BUTmany of the compression techniques are not supported by FPDI. It then returns an error:

FPDI 易于使用和安装（只需提取文件并调用 PHP 脚本），但FPDI 不支持许多压缩技术。然后它返回一个错误：

FPDF error: This document (test_1.pdf) probably uses a compression technique which is not supported by the free parser shipped with FPDI.

FPDF 错误：此文档 (test_1.pdf) 可能使用了 FPDI 附带的免费解析器不支持的压缩技术。

Opening a stream and search with a regular expression:

打开一个流并使用正则表达式进行搜索：

This opens the PDF file in a stream and searches for some kind of string, containing the pagecount or something similar.

这会在流中打开 PDF 文件并搜索某种字符串，其中包含页数或类似内容。

$f = "test1.pdf";
$stream = fopen($f, "r");
$content = fread ($stream, filesize($f));

if(!$stream || !$content)
    return 0;

$count = 0;
// Regular Expressions found by Googling (all linked to SO answers):
$regex  = "/\/Count\s+(\d+)/";
$regex2 = "/\/Page\W*(\d+)/";
$regex3 = "/\/N\s+(\d+)/";

if(preg_match_all($regex, $content, $matches))
    $count = max($matches);

return $count;

/\/Count\s+(\d+)/(looks for /Count <number>) doesn't work because only a few documents have the parameter /Countinside, so most of the time it doesn't return anything. Source.
/\/Page\W*(\d+)/(looks for /Page<number>) doesn't get the number of pages, mostly contains some other data. Source.
/\/N\s+(\d+)/(looks for /N <number>) doesn't work either, as the documents can contain multiple values of /N; most, if not all, notcontaining the pagecount. Source.

/\/Count\s+(\d+)/(looks for /Count <number>) 不起作用，因为只有少数文档/Count里面有参数，所以大多数时候它不返回任何东西。来源。
/\/Page\W*(\d+)/(looks for /Page<number>) 没有得到页数，主要包含一些其他数据。来源。
/\/N\s+(\d+)/(查找/N <number>) 也不起作用，因为文档可以包含/N; 的多个值。大多数（如果不是全部）不包含页数。来源。

So, what does work reliable and accurate?
See the answer below

那么，什么工作可靠又准确呢？
看下面的答案

Answer 1

回答by Richard de Wit

A simple command line executable called: pdfinfo.

一个简单的命令行可执行文件，名为：pdfinfo。

It is downloadable for Linux and Windows. You download a compressed file containing several little PDF-related programs. Extract it somewhere.

它可用于 Linux 和 Windows 下载。您下载一个包含几个与 PDF 相关的小程序的压缩文件。将其提取到某处。

One of those files is pdfinfo(or pdfinfo.exefor Windows). An example of data returned by running it on a PDF document:

这些文件之一是pdfinfo（或Windows 的pdfinfo.exe）。通过在 PDF 文档上运行它返回的数据示例：

Title:          test1.pdf
Author:         John Smith
Creator:        PScript5.dll Version 5.2.2
Producer:       Acrobat Distiller 9.2.0 (Windows)
CreationDate:   01/09/13 19:46:57
ModDate:        01/09/13 19:46:57
Tagged:         yes
Form:           none
Pages:          13    <-- This is what we need
Encrypted:      no
Page size:      2384 x 3370 pts (A0)
File size:      17569259 bytes
Optimized:      yes
PDF version:    1.6

I haven't seen a PDF document where it returned a false pagecount (yet). It is also really fast, even with big documents of 200+ MB the response time is a just a few seconds or less.

我还没有看到返回错误页数的 PDF 文档（还没有）。它也非常快，即使是 200+ MB 的大文档，响应时间也只有几秒钟或更短。

There is an easy way of extracting the pagecount from the output, here in PHP:

有一种从输出中提取页数的简单方法，在 PHP 中：

// Make a function for convenience 
function getPDFPages($document)
{
    $cmd = "/path/to/pdfinfo";           // Linux
    $cmd = "C:\path\to\pdfinfo.exe";  // Windows

    // Parse entire output
    // Surround with double quotes if file name has spaces
    exec("$cmd \"$document\"", $output);

    // Iterate through lines
    $pagecount = 0;
    foreach($output as $op)
    {
        // Extract the number
        if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)
        {
            $pagecount = intval($matches[1]);
            break;
        }
    }

    return $pagecount;
}

// Use the function
echo getPDFPages("test 1.pdf");  // Output: 13

Of course this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.

当然这个命令行工具可以用在其他可以解析外部程序输出的语言中，但我在PHP中使用它。

I know its not pure PHP, but external programs are waybetter in PDF handling (as seen in the question).

我知道它不是纯 PHP，但外部程序在 PDF 处理方面更好（如问题所示）。

I hope this can help people, because I have spent a whole lot of time trying to find the solution to this and I have seen a lot of questions about PDF pagecount in which I didn't find the answer I was looking for. That's why I made this question and answered it myself.

我希望这可以帮助人们，因为我花了很多时间试图找到解决方案，而且我看到了很多关于 PDF 页数的问题，但我没有找到我想要的答案。这就是我提出这个问题并自己回答的原因。

Answer 2

回答by Kuldeep Dangi

Simplest of all is using ImageMagick

最简单的是使用ImageMagick

here is a sample code

这是一个示例代码

$image = new Imagick();
$image->pingImage('myPdfFile.pdf');
echo $image->getNumberImages();

otherwise you can also use PDFlibraries like MPDFor TCPDFfor PHP

否则，您也可以使用PDF类似MPDF或TCPDF用于的库PHP

Answer 3

回答by Muad'Dib

if you can't install any additional packages, you can use this simple one-liner:

如果你不能安装任何额外的包，你可以使用这个简单的单行：

foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*||p' | sort -rn | head -n 1)

Answer 4

回答by dhildreth

This seems to work pretty well, without the need for special packages or parsing command output.

这似乎工作得很好，不需要特殊的包或解析命令输出。

<?php                                                                               

$target_pdf = "multi-page-test.pdf";                                                
$cmd = sprintf("identify %s", $target_pdf);                                         
exec($cmd, $output);                                                                
$pages = count($output);

Answer 5

回答by Saran

If you have access to shell, a simplest (but not usable on 100% of PDFs) approach would be to use grep.

如果您可以访问 shell，最简单的（但不适用于 100% 的 PDF）方法是使用grep.

This should return just the number of pages:

这应该只返回页数：

grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf

Example: https://regex101.com/r/BrUTKn/1

示例：https: //regex101.com/r/BrUTKn/1

Switches description:

开关说明：

-m 1is neccessary as some files can have more than one match of regex pattern (volonteer needed to replace this with match-only-first regex solution extension)
-ais neccessary to treat the binary file as text
-oto show only the match
-Pto use Perl regular expression

-m 1是必要的，因为某些文件可以有多个匹配的正则表达式模式（志愿者需要将其替换为仅匹配第一正则表达式解决方案扩展名）
-a有必要将二进制文件视为文本
-o只显示匹配
-P使用 Perl 正则表达式

Regex explanation:

正则表达式解释：

starting "delimiter": (?<=\/N )lookbehind of /N(nb. space character not seen here)
actual result: \d+any number of digits
ending "delimiter": (?=\/)lookahead of /

开始“分隔符”：(?<=\/N )后视/N（注意这里没有看到空格字符）
实际结果：\d+任意位数
结束“分隔符”：(?=\/)先行/

Nota bene: if in some case match is not found, it's safe to assume only 1 page exists.

注意：如果在某些情况下找不到匹配项，则可以安全地假设只有 1 页存在。

Answer 6

回答by Franck Dernoncourt

Since you're ok with using command line utilities, you can use cpdf(Microsoft Windows/Linux/Mac OS X). To obtain the number of pages in one PDF:

由于您可以使用命令行实用程序，因此您可以使用 cpdf(Microsoft Windows/Linux/Mac OS X)。要获取一个 PDF 中的页数：

cpdf.exe -pages "my file.pdf"

Answer 7

回答by SuperNova

You can use qpdflike below. If a file file_name.pdf has 100 pages,

你可以qpdf像下面这样使用。如果文件 file_name.pdf 有 100 页，

$ qpdf --show-npages file_name.pdf
100

Answer 8

回答by james-geldart

I created a wrapper class for pdfinfo in case it's useful to anyone, based on Richard's answer@

我为 pdfinfo 创建了一个包装类，以防它对任何人都有用，基于 Richard 的回答@

/**
 * Wrapper for pdfinfo program, part of xpdf bundle
 * http://www.xpdfreader.com/about.html
 * 
 * this will put all pdfinfo output into keyed array, then make them accessible via getValue
 */
class PDFInfoWrapper {

    const PDFINFO_CMD = 'pdfinfo';

    /**
     * keyed array to hold all the info
     */
    protected $info = array();

    /**
     * raw output in case we need it
     */
    public $raw = "";

    /**
     * Constructor
     * @param string $filePath - path to file
     */
    public function __construct($filePath) {
        exec(self::PDFINFO_CMD . ' "' . $filePath . '"', $output);

        //loop each line and split into key and value
        foreach($output as $line) {
            $colon = strpos($line, ':');
            if($colon) {
                $key = trim(substr($line, 0, $colon));
                $val = trim(substr($line, $colon + 1));

                //use strtolower to make case insensitive
                $this->info[strtolower($key)] = $val;
            }
        }

        //store the raw output
        $this->raw = implode("\n", $output);

    }

    /**
     * get a value
     * @param string $key - key name, case insensitive
     * @returns string value
     */
    public function getValue($key) {
        return @$this->info[strtolower($key)];
    }

    /**
     * list all the keys
     * @returns array of key names
     */
    public function getAllKeys() {
        return array_keys($this->info);
    }

}

Answer 9

回答by Feiming Chen

Here is a Rfunction that reports the PDF file page number by using the pdfinfocommand.

这是一个R使用pdfinfo命令报告PDF文件页码的函数。

pdf.file.page.number <- function(fname) {
    a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))
    page.number <- as.numeric(readLines(a))
    close(a)
    page.number
}
if (F) {
    pdf.file.page.number("a.pdf")
}

Answer 10

回答by commander

Here is a Windows command script using gsscript that reports the PDF file page number

这是使用 gsscript 报告 PDF 文件页码的 Windows 命令脚本

@echo off
echo.
rem
rem this file: getlastpagenumber.cmd
rem version 0.1 from commander 2015-11-03
rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/
rem Install path "C:\prg\ghostscript" for using the script without changes \ and have less problems with UAC
rem

:vars
  set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"
  set __lastpagenumber__=1
  set __pdffile__="%~1"
  set __pdffilename__="%~n1"
  set __datetime__=%date%%time%
  set __datetime__=%__datetime__:.=%
  set __datetime__=%__datetime__::=%
  set __datetime__=%__datetime__:,=%
  set __datetime__=%__datetime__:/=% 
  set __datetime__=%__datetime__: =% 
  set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"

:check
  if %__pdffile__%=="" goto error1
  if not exist %__pdffile__% goto error2
  if not exist %__gs__% goto error3

:main
  %__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE  -sstdout=%__tmpfile__%  %__pdffile__%
  FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A 
  set __lastpagenumber__=%__lastpagenumber__: =%
  if exist %__tmpfile__% del %__tmpfile__%

:output
  echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages
  goto end

:error1
  echo no pdf file selected
  echo usage: %~n0 PDFFILE
  goto end

:error2
  echo no pdf file found
  echo usage: %~n0 PDFFILE
  goto end

:error3
  echo.can not find the ghostscript bin file
  echo.   %__gs__%
  echo.please download it from:
  echo.   http://www.ghostscript.com/download/
  echo.and install to "C:\prg\ghostscript"
  goto end

:end
  exit /b

php 获取 PDF 文档的页数

提问by Richard de Wit

This question is for referencing and comparing. The solution is the accepted answer below.

本题供参考和比较。解决方案是下面接受的答案。

Using Imagick(a PHP extension)

使用Imagick（一个 PHP 扩展）

Using FPDI(a PHP library)

使用FPDI（一个 PHP 库）

Opening a stream and search with a regular expression:

打开一个流并使用正则表达式进行搜索：

So, what does work reliable and accurate?

那么，什么工作可靠又准确呢？

回答by Richard de Wit

A simple command line executable called: pdfinfo.

一个简单的命令行可执行文件，名为：pdfinfo。

回答by Kuldeep Dangi

回答by Muad'Dib

回答by dhildreth

回答by Saran

回答by Franck Dernoncourt

回答by SuperNova

回答by james-geldart

回答by Feiming Chen

回答by commander

相关推荐

最近更新

标签

php 获取 PDF 文档的页数

提问by Richard de Wit

This question is for referencing and comparing. The solution is the accepted answer below.

本题供参考和比较。解决方案是下面接受的答案。

Using Imagick(a PHP extension)

使用Imagick（一个 PHP 扩展）

Using FPDI(a PHP library)

使用FPDI（一个 PHP 库）

Opening a stream and search with a regular expression:

打开一个流并使用正则表达式进行搜索：

So, what does work reliable and accurate?

那么，什么工作可靠又准确呢？

回答by Richard de Wit

A simple command line executable called: pdfinfo.

一个简单的命令行可执行文件，名为：pdfinfo。

回答by Kuldeep Dangi

回答by Muad'Dib

回答by dhildreth

回答by Saran

回答by Franck Dernoncourt

回答by SuperNova

回答by james-geldart

回答by Feiming Chen

回答by commander

相关推荐

php Yii CGridView 为标题单元格添加类或样式

在 php 中返回对象

php 内容数据库 - 可以存储 HTML 吗？

php mysql选择大于3个月的记录

相关推荐

最近更新

标签