php 用php读取pdf文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1004478/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 00:36:22  来源:igfitidea点击:

Read pdf files with php

phppdf

提问by Ryan Doherty

I have a large PDF file that is a floor map for a building. It has layers for all the office furniture including text boxes of seat location.

我有一个很大的 PDF 文件,它是建筑物的平面图。它具有所有办公家具的图层,包括座位位置的文本框。

My goal is to read this file with PHP, search the document for text layers, get their contents and coordinates in the file. This way I can map out seat locations -> x/y coordinates.

我的目标是用 PHP 读取这个文件,在文档中搜索文本层,获取它们在文件中的内容和坐标。这样我就可以绘制出座位位置 -> x/y 坐标。

Is there any way to do this via PHP? (Or even Ruby or Python if that's what's necessary)

有没有办法通过 PHP 做到这一点?(如果有必要,甚至可以使用 Ruby 或 Python)

采纳答案by Jay

Check out FPDF (with FPDI):

查看 FPDF(使用 FPDI):

http://www.fpdf.org/

http://www.fpdf.org/

http://www.setasign.de/products/pdf-php-solutions/fpdi/

http://www.setasign.de/products/pdf-php-solutions/fpdi/

These will let you open an pdf and add content to it in PHP. I'm guessing you can also use their functionality to search through the existing content for the values you need.

这些将让您打开 pdf 并在 PHP 中向其中添加内容。我猜您还可以使用它们的功能在现有内容中搜索您需要的值。

Another possible library is TCPDF: http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=tcpdf

另一个可能的库是 TCPDF:http: //www.tecnick.com/public/code/cp_dpage.php?aiocp_dp =tcpdf

Update to add a more modern library: PDF Parser

更新以添加更现代的库:PDF Parser

回答by kasper Taeymans

There is a php library (pdfparser) that does exactly what you want.

有一个 php 库 (pdfparser) 可以完全满足您的需求。

project website

项目网站

http://www.pdfparser.org/

http://www.pdfparser.org/

github

github

https://github.com/smalot/pdfparser

https://github.com/smalot/pdfparser

Demo page/api

演示页面/api

http://www.pdfparser.org/demo

http://www.pdfparser.org/demo

After including pdfparser in your project you can get all text from mypdf.pdflike so:

在您的项目中包含 pdfparser 后,您可以mypdf.pdf像这样获取所有文本:

<?php
$parser = new \installpath\PdfParser\Parser();
$pdf    = $parser->parseFile('mypdf.pdf');  
$text = $pdf->getText();
echo $text;//all text from mypdf.pdf

?>

Simular you can get the metadata from the pdf as wel as getting the pdf objects (for example images).

Simular 您可以从 pdf 获取元数据以及获取 pdf 对象(例如图像)。

回答by Rado

Not exactly php, but you could exec a program from php to convert the pdf to a temporary html file and then parse the resulting file with php. I've done something similar for a project of mine and this is the program I used:

不完全是 php,但您可以从 php 执行程序将 pdf 转换为临时 html 文件,然后用 php 解析生成的文件。我为我的一个项目做了类似的事情,这是我使用的程序:

PdfToHtml

pdf转html

The resulting HTML wraps text elements in < div > tags with absolute position coordinates. It seems like this is exactly what you are trying to do.

生成的 HTML 将文本元素包装在具有绝对位置坐标的 <div> 标签中。看起来这正是你想要做的。

回答by jmo

your initial request is "I have a large PDF file that is a floor map for a building. "

您最初的要求是“我有一个大型 PDF 文件,它是建筑物的平面图。”

I am afraid to tell you this might be harder than you guess.

我不敢告诉你这可能比你想象的要难。

Cause the last known lib everyones use to parse pdf is smalot, and this one is known to encounter issue regarding large file.

因为大家用来解析 pdf 的最后一个已知的 lib 是 smalot,而这个已知会遇到关于大文件的问题。

Here too, Lookig for a real php lib to parse pdf, without any memory peak that need a php configuration to disable memory limit as lot of "developers" does (which I guess is really not advisable).

在这里,Lookig 也寻找一个真正的 php 库来解析 pdf,没有任何内存峰值需要 php 配置来禁用内存限制,就像很多“开发人员”所做的那样(我想这真的是不可取的)。

see this post for more details about smalot performance : https://github.com/smalot/pdfparser/issues/163

有关 smalot 性能的更多详细信息,请参阅此帖子:https: //github.com/smalot/pdfparser/issues/163

回答by Mike

You might want to also try this application http://pdfbox.apache.org/. A working example can be found at https://www.jinises.com

您可能还想尝试这个应用程序http://pdfbox.apache.org/。可以在https://www.jinises.com上找到一个工作示例