用 PHP 读取 PDF 元数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4493189/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 13:12:20  来源:igfitidea点击:

Reading PDF metadata in PHP

phppdfmetadata

提问by

I'm trying to read metadata attached to arbitrary PDFs: title, author, subject, and keywords.

我正在尝试读取附加到任意 PDF 的元数据:标题、作者、主题和关键字。

Is there a PHP library, preferably open-source, that can read PDF metadata? If so, or if there isn't, how would one use the library (or lack thereof) to extract the metadata?

是否有可以读取 PDF 元数据的 PHP 库,最好是开源库?如果是这样,或者如果没有,人们将如何使用库(或缺少库)来提取元数据?

To be clear, I'm not interested in creating or modifying PDFs or their metadata, and I don't care about the PDF bodies. I've looked at a number of libraries, including FPDF (which everyone seems to recommend), but it appears only to be for PDF creation, not metadata extraction.

明确地说,我对创建或修改 PDF 或其元数据不感兴趣,我也不关心 PDF 正文。我查看了许多库,包括 FPDF(似乎每个人都推荐),但它似乎仅用于 PDF 创建,而不是元数据提取。

采纳答案by

The Zend framework includes Zend_Pdf, which makes this really easy:

Zend 框架包含Zend_Pdf,这使得这非常容易:

$pdf = Zend_Pdf::load($pdfPath);

echo $pdf->properties['Title'] . "\n";
echo $pdf->properties['Author'] . "\n";

Limitations: Works only on files without encryption smaller then 16MB.

限制:仅适用于小于 16MB 的未加密文件。

回答by Alessandro Cosentino

PDF Parserdoes exactly what you want and it's pretty straightforward to use:

PDF Parser完全符合您的要求,并且使用起来非常简单:

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
$text   = $pdf->getDetails();

You can try it in the demo page.

您可以在演示页面试用

回答by cbrandolino

Don't know about libraries, but a simple way to achieve the same result might be fopening the file and parsing everything that comes after the last "endstream".

不知道库,但实现相同结果的一种简单方法可能是打开文件并解析最后一个“endstream”之后的所有内容。

Try to open a pdf on a text editor, a parser shouldn't take more than five lines.

尝试在文本编辑器上打开 pdf,解析器不应超过五行。

回答by maxpower9000

I was looking for the same thing today. And I came across a small PHP class over at http://de77.com/that offers a quick and dirty solution. You can download the classdirectly. Output is UTF-8 encoded.

我今天也在寻找同样的东西。我在http://de77.com/上遇到了一个小的 PHP 类,它提供了一个快速而肮脏的解决方案。您可以直接下载课程。输出采用 UTF-8 编码。

The creator says:

创造者说:

Here's a PHP class I wrote which can be used to get title & author and a number of pages of any PDF file. It does not use any external application - just pure PHP.

这是我编写的一个 PHP 类,可用于获取标题和作者以及任何 PDF 文件的页数。它不使用任何外部应用程序 - 只是纯 PHP。

// basic example
include 'PDFInfo.php';
$p = new PDFInfo;
$p->load('file.pdf');
echo $p->author;
echo $p->title;
echo $p->pages;

For me, it work's! All thanks goes solely to the creator of the class ... well, maybe just a little bit thanks to me too for finding the class ;)

对我来说,它起作用了!所有的感谢完全归功于课程的创建者......好吧,也许也有点感谢我找到了课程;)

回答by ved uniyalas

<?php 

    $sourcefile = "file path";
    $stringedPDF = file_get_contents($sourcefile, true);

    preg_match('/(?<=Title )\S(?:(?<=\().+?(?=\))|(?<=\[).+?(?=\]))./', $stringedPDF, $title);
    echo $all = $title[0];

回答by maxpower9000

You may use PDFtkto extract the page count:

您可以使用PDFtk来提取页数:

// Windows
$bin = realpath('C:\pdftk\bin\pdftk.exe');
$cmd = "cmd /c {$bin} {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]*//'";

// Unix
$cmd = "pdftk {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]*

If ImageMagickis available you may also use:

如果ImageMagick可用,您还可以使用:

$cmd = "identify -format %n {$path}";

Execute in PHP via shell_exec():

通过shell_exec()在 PHP 中执行:

$res = shell_exec($cmd);