用 PHP 读取 PDF 元数据

Question

提问by

I'm trying to read metadata attached to arbitrary PDFs: title, author, subject, and keywords.

我正在尝试读取附加到任意 PDF 的元数据：标题、作者、主题和关键字。

Is there a PHP library, preferably open-source, that can read PDF metadata? If so, or if there isn't, how would one use the library (or lack thereof) to extract the metadata?

是否有可以读取 PDF 元数据的 PHP 库，最好是开源库？如果是这样，或者如果没有，人们将如何使用库（或缺少库）来提取元数据？

To be clear, I'm not interested in creating or modifying PDFs or their metadata, and I don't care about the PDF bodies. I've looked at a number of libraries, including FPDF (which everyone seems to recommend), but it appears only to be for PDF creation, not metadata extraction.

明确地说，我对创建或修改 PDF 或其元数据不感兴趣，我也不关心 PDF 正文。我查看了许多库，包括 FPDF（似乎每个人都推荐），但它似乎仅用于 PDF 创建，而不是元数据提取。

Answer 1

采纳答案by

The Zend framework includes Zend_Pdf, which makes this really easy:

Zend 框架包含Zend_Pdf，这使得这非常容易：

$pdf = Zend_Pdf::load($pdfPath);

echo $pdf->properties['Title'] . "\n";
echo $pdf->properties['Author'] . "\n";

Limitations: Works only on files without encryption smaller then 16MB.

限制：仅适用于小于 16MB 的未加密文件。

Answer 2

回答by Alessandro Cosentino

PDF Parserdoes exactly what you want and it's pretty straightforward to use:

PDF Parser完全符合您的要求，并且使用起来非常简单：

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
$text   = $pdf->getDetails();

You can try it in the demo page.

您可以在演示页面中试用。

Answer 3

回答by cbrandolino

Don't know about libraries, but a simple way to achieve the same result might be fopening the file and parsing everything that comes after the last "endstream".

不知道库，但实现相同结果的一种简单方法可能是打开文件并解析最后一个“endstream”之后的所有内容。

Try to open a pdf on a text editor, a parser shouldn't take more than five lines.

尝试在文本编辑器上打开 pdf，解析器不应超过五行。

Answer 4

回答by maxpower9000

I was looking for the same thing today. And I came across a small PHP class over at http://de77.com/that offers a quick and dirty solution. You can download the classdirectly. Output is UTF-8 encoded.

我今天也在寻找同样的东西。我在http://de77.com/上遇到了一个小的 PHP 类，它提供了一个快速而肮脏的解决方案。您可以直接下载课程。输出采用 UTF-8 编码。

The creator says:

创造者说：

Here's a PHP class I wrote which can be used to get title & author and a number of pages of any PDF file. It does not use any external application - just pure PHP.

这是我编写的一个 PHP 类，可用于获取标题和作者以及任何 PDF 文件的页数。它不使用任何外部应用程序 - 只是纯 PHP。

// basic example
include 'PDFInfo.php';
$p = new PDFInfo;
$p->load('file.pdf');
echo $p->author;
echo $p->title;
echo $p->pages;

For me, it work's! All thanks goes solely to the creator of the class ... well, maybe just a little bit thanks to me too for finding the class ;)

对我来说，它起作用了！所有的感谢完全归功于课程的创建者......好吧，也许也有点感谢我找到了课程;）

Answer 5

回答by ved uniyalas

<?php 

    $sourcefile = "file path";
    $stringedPDF = file_get_contents($sourcefile, true);

    preg_match('/(?<=Title )\S(?:(?<=\().+?(?=\))|(?<=\[).+?(?=\]))./', $stringedPDF, $title);
    echo $all = $title[0];

Answer 6

回答by maxpower9000

You may use PDFtkto extract the page count:

您可以使用PDFtk来提取页数：

// Windows
$bin = realpath('C:\pdftk\bin\pdftk.exe');
$cmd = "cmd /c {$bin} {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]*//'";

// Unix
$cmd = "pdftk {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]*

If ImageMagickis available you may also use:

如果ImageMagick可用，您还可以使用：

$cmd = "identify -format %n {$path}";

Execute in PHP via shell_exec():

通过shell_exec()在 PHP 中执行：

$res = shell_exec($cmd);

用 PHP 读取 PDF 元数据

提问by

采纳答案by

回答by Alessandro Cosentino

回答by cbrandolino

回答by maxpower9000

回答by ved uniyalas

回答by maxpower9000

相关推荐

最近更新

标签

用 PHP 读取 PDF 元数据

提问by

采纳答案by

回答by Alessandro Cosentino

回答by cbrandolino

回答by maxpower9000

回答by ved uniyalas

回答by maxpower9000

相关推荐

php 通过表单发送基本认证信息

调用非对象 PHP 帮助上的成员函数 prepare()

PHP 错误：注意：未定义索引：

php 根据值从多维数组中删除元素

相关推荐

最近更新

标签