php 用php读取pdf文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1004478/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read pdf files with php
提问by Ryan Doherty
I have a large PDF file that is a floor map for a building. It has layers for all the office furniture including text boxes of seat location.
我有一个很大的 PDF 文件,它是建筑物的平面图。它具有所有办公家具的图层,包括座位位置的文本框。
My goal is to read this file with PHP, search the document for text layers, get their contents and coordinates in the file. This way I can map out seat locations -> x/y coordinates.
我的目标是用 PHP 读取这个文件,在文档中搜索文本层,获取它们在文件中的内容和坐标。这样我就可以绘制出座位位置 -> x/y 坐标。
Is there any way to do this via PHP? (Or even Ruby or Python if that's what's necessary)
有没有办法通过 PHP 做到这一点?(如果有必要,甚至可以使用 Ruby 或 Python)
采纳答案by Jay
Check out FPDF (with FPDI):
查看 FPDF(使用 FPDI):
http://www.setasign.de/products/pdf-php-solutions/fpdi/
http://www.setasign.de/products/pdf-php-solutions/fpdi/
These will let you open an pdf and add content to it in PHP. I'm guessing you can also use their functionality to search through the existing content for the values you need.
这些将让您打开 pdf 并在 PHP 中向其中添加内容。我猜您还可以使用它们的功能在现有内容中搜索您需要的值。
Another possible library is TCPDF: http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=tcpdf
另一个可能的库是 TCPDF:http: //www.tecnick.com/public/code/cp_dpage.php?aiocp_dp =tcpdf
Update to add a more modern library: PDF Parser
更新以添加更现代的库:PDF Parser
回答by kasper Taeymans
There is a php library (pdfparser) that does exactly what you want.
有一个 php 库 (pdfparser) 可以完全满足您的需求。
project website
项目网站
github
github
https://github.com/smalot/pdfparser
https://github.com/smalot/pdfparser
Demo page/api
演示页面/api
After including pdfparser in your project you can get all text from mypdf.pdflike so:
在您的项目中包含 pdfparser 后,您可以mypdf.pdf像这样获取所有文本:
<?php
$parser = new \installpath\PdfParser\Parser();
$pdf = $parser->parseFile('mypdf.pdf');
$text = $pdf->getText();
echo $text;//all text from mypdf.pdf
?>
Simular you can get the metadata from the pdf as wel as getting the pdf objects (for example images).
Simular 您可以从 pdf 获取元数据以及获取 pdf 对象(例如图像)。
回答by Rado
Not exactly php, but you could exec a program from php to convert the pdf to a temporary html file and then parse the resulting file with php. I've done something similar for a project of mine and this is the program I used:
不完全是 php,但您可以从 php 执行程序将 pdf 转换为临时 html 文件,然后用 php 解析生成的文件。我为我的一个项目做了类似的事情,这是我使用的程序:
The resulting HTML wraps text elements in < div > tags with absolute position coordinates. It seems like this is exactly what you are trying to do.
生成的 HTML 将文本元素包装在具有绝对位置坐标的 <div> 标签中。看起来这正是你想要做的。
回答by jmo
your initial request is "I have a large PDF file that is a floor map for a building. "
您最初的要求是“我有一个大型 PDF 文件,它是建筑物的平面图。”
I am afraid to tell you this might be harder than you guess.
我不敢告诉你这可能比你想象的要难。
Cause the last known lib everyones use to parse pdf is smalot, and this one is known to encounter issue regarding large file.
因为大家用来解析 pdf 的最后一个已知的 lib 是 smalot,而这个已知会遇到关于大文件的问题。
Here too, Lookig for a real php lib to parse pdf, without any memory peak that need a php configuration to disable memory limit as lot of "developers" does (which I guess is really not advisable).
在这里,Lookig 也寻找一个真正的 php 库来解析 pdf,没有任何内存峰值需要 php 配置来禁用内存限制,就像很多“开发人员”所做的那样(我想这真的是不可取的)。
see this post for more details about smalot performance : https://github.com/smalot/pdfparser/issues/163
有关 smalot 性能的更多详细信息,请参阅此帖子:https: //github.com/smalot/pdfparser/issues/163
回答by Mike
You might want to also try this application http://pdfbox.apache.org/. A working example can be found at https://www.jinises.com
您可能还想尝试这个应用程序http://pdfbox.apache.org/。可以在https://www.jinises.com上找到一个工作示例

