Linux 如何将 PDF 转换为文本,以便我可以使用 PHP 解析该文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6451626/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 04:41:16  来源:igfitidea点击:

How do I convert a PDF to text so I can parse that text with PHP?

phplinuxpdfimport

提问by T. Brian Jones

I have PDFs that are mostly simply formatted text. I would like to parse the text with PHP. I realize that the PDF is binary so I need a utility or library to convert it to text.

我的 PDF 主要是简单格式化的文本。我想用 PHP 解析文本。我意识到 PDF 是二进制的,所以我需要一个实用程序或库将其转换为文本。

Any recommendations?

有什么建议吗?

采纳答案by T. Brian Jones

I ended up using XPDF ( which includes pdftotext ). This works great and I use it in production to extract text from millions of PDFs being uploaded to our servers.

我最终使用了 XPDF(包括 pdftotext )。这很好用,我在生产中使用它从上传到我们服务器的数百万个 PDF 中提取文本。

Below is the install process for Linux CentOS:

以下是 Linux CentOS 的安装过程:

  1. download version 3.03 from here: http://foolabs.com/xpdf/download.html
  2. tar -zxvf xpdfbin-linux-3.03.tar.gz ( extract tar.gz )
  3. create required directories for install ( some or all of these might exist already )
    • sudo mkdir /usr/local/man/
    • sudo mkdir /usr/local/man/man1/
    • sudo mkdir /usr/local/man/man5/
    • sudo mkdir /usr/local/etc/xpdfrc/
  4. move files from extracted folders ( cd into the folder where xpdf was just unzipped )
    • move all the executables from the bin64 directory (xpdf, pdftotext ... all the files ) to /usr/local/bin/
    • move the sample-xpdfrc file to /usr/local/etc/xpdfrc ( this can be used as is )
    • move the manual pages from the doc directory ( *.1 to /usr/local/man/man1/ & *.5 to /usr/local/man/man5/ )
  5. xpdf should be installed and ready to use
  6. you can delete the downloaded tar.gz file and the folder where it was unzipped
  1. 从这里下载 3.03 版:http: //foolabs.com/xpdf/download.html
  2. tar -zxvf xpdfbin-linux-3.03.tar.gz(提取tar.gz)
  3. 创建安装所需的目录(其中部分或全部可能已经存在)
    • 须藤 mkdir /usr/local/man/
    • 须藤 mkdir /usr/local/man/man1/
    • 须藤 mkdir /usr/local/man/man5/
    • 须藤 mkdir /usr/local/etc/xpdfrc/
  4. 从提取的文件夹中移动文件( cd 到刚刚解压 xpdf 的文件夹中)
    • 将所有可执行文件从 bin64 目录(xpdf、pdftotext ...所有文件)移动到 /usr/local/bin/
    • 将 sample-xpdfrc 文件移动到 /usr/local/etc/xpdfrc (这可以按原样使用)
    • 将手册页从 doc 目录( *.1 到 /usr/local/man/man1/ & *.5 到 /usr/local/man/man5/ )
  5. xpdf 应该已安装并可以使用
  6. 您可以删除下载的 tar.gz 文件和解压后的文件夹

回答by DevelRoot

You can't do that with file_get_contents()because PDF files contain only binary data (no plain text). To read / modify a pdf file you can use some third-party libraries. Take a look at:

你不能这样做,file_get_contents()因为 PDF 文件只包含二​​进制数据(没有纯文本)。要阅读/修改 pdf 文件,您可以使用一些第三方库。看一眼:

And don't forget

并且不要忘记

回答by Benoit

Third party software can dump the text contents of a PDF file, for example:

第三方软件可以转储 PDF 文件的文本内容,例如:

  • xdoc2txt (Windows-only, used in WinMerge plugins)
  • pdftotext, part of Xpdf
  • xdoc2txt(仅限 Windows,在 WinMerge 插件中使用)
  • pdftotext,Xpdf 的一部分