在 PHP 中将 PDF 转换为 HTML?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14782751/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 07:57:46  来源:igfitidea点击:

Convert PDF to HTML in PHP?

php

提问by Charlie

I want to be able to convert a PDF file to an HTML file via PHP, but am running into some trouble.

我希望能够通过 PHP 将 PDF 文件转换为 HTML 文件,但遇到了一些麻烦。

I found a basic way to do this using Saaspose, which lets you convert PDF's to HTML files. There are some problems with this, however, such as the use of SVGs, images, positioning, fonts, etc.

我找到了一种使用Saaspose的基本方法,它可以让您将 PDF 转换为 HTML 文件。但是,这存在一些问题,例如使用 SVG、图像、定位、字体等。

All I would need is the ability to grab the text from the PHP file and any images associated with it, and then display it in a linear format as opposed to it being formatted with absolute positioning.

我所需要的只是能够从 PHP 文件中获取文本以及与之关联的任何图像,然后以线性格式显示它,而不是使用绝对定位进行格式化。

What I mean by this is that if the PDF looks like this:

我的意思是,如果 PDF 看起来像这样:

enter image description here

在此处输入图片说明

I'd want to convert it to a single column design HTML file. If there were images, I'd want them returned as well.

我想将其转换为单列设计 HTML 文件。如果有图像,我也希望它们返回。

Is this possible in PHP? I know I can simply grab the text from the PDF file, but what about grabbing images as well?

这在 PHP 中可能吗?我知道我可以简单地从 PDF 文件中抓取文本,但是抓取图像呢?

Another problem is that I want everything to be inline, as it's being served to the client in a single file. Currently, I can do this with my setup through some code:

另一个问题是我希望所有内容都是内联的,因为它在单个文件中提供给客户端。目前,我可以通过一些代码使用我的设置来做到这一点:

for ($i = 0; $i < $object_number; $i++) {
                $object = $html->find("object")->find("embed")->eq($i);
                $embed = file_get_contents("Output/OutputHtml/" . $object->attr("src"));
                array_push($converted_obj, $embed);
                array_push($original_obj, $object);
            }

            for ($i = 0; $i < $object_number; $i++){
                pq($original_obj[$i])->replaceWith($converted_obj[$i]);
            }

Which grabs all the SVGfiles and displays them inline. Images would be easier for this, as I could use base64.

它抓取所有SVG文件并内联显示它们。图像会更容易,因为我可以使用base64.

采纳答案by T.Todua

1) download and unpack the .exe file to a folder: http://sourceforge.net/projects/pdftohtml/

1) 下载 .exe 文件并解压到一个文件夹:http: //sourceforge.net/projects/pdftohtml/

2) create a .php file, and put this code (assuming, that the pdftohtml.exe is inside that folder, and the source sample.pdf too):

2)创建一个 .php 文件,并放置此代码(假设 pdftohtml.exe 在该文件夹中,源 sample.pdf 也是):

<?php
$source_pdf="sample.pdf";
$output_folder="MyFolder";

    if (!file_exists($output_folder)) { mkdir($output_folder, 0777, true);}
$a= passthru("pdftohtml $source_pdf $output_folder/new_file_name",$b);
var_dump($a);
?>

3) enter MyFolder, and you will see the converted files (depends on the number of pages..)

3) 输入MyFolder,您将看到转换后的文件(取决于页数..)

p.s. i dont know, but there exists many commercial or trial apis too.

ps 我不知道,但也存在许多商业或试用 api。

回答by hindmost

Cross-platform solution using Xpdf:

使用Xpdf 的跨平台解决方案:

Download appropriate package of the Xpdf toolsand unpack it into a subdirectory in your script's directory. Let's assume it's called "/xpdftools".

下载适当的Xpdf 工具包并将其解压到脚本目录中的子目录中。让我们假设它被称为“/xpdftools”。

Add such a code into your php script:

将这样的代码添加到您的 php 脚本中:

$pdf_file = 'sample.pdf';
$html_dir = 'htmldir';
$cmd = "xpdftools/bin32/pdftohtml $pdf_file $html_dir";

exec($cmd, $out, $ret);
echo "Exit code: $ret";

After successful script execution htmldirdirectory will contain converted html files (each page in a separate file).

脚本执行成功后htmldir目录将包含转换后的 html 文件(每个页面在一个单独的文件中)。

The Xpdf tools use the following exit codes:

Xpdf 工具使用以下退出代码:

  • 0- No error.
  • 1- Error opening a PDF file.
  • 2- Error opening an output file.
  • 3- Error related to PDF permissions.
  • 99- Other error.
  • 0- 没有错误。
  • 1- 打开 PDF 文件时出错。
  • 2- 打开输出文件时出错。
  • 3- 与 PDF 权限相关的错误。
  • 99- 其他错误。

回答by hindmost

What you are essentially looking to do is to reflow the PDF file. I'm not sure this exists, and is at best very difficult to do.

您本质上要做的是重排 PDF 文件。我不确定这是否存在,而且充其量也很难做到。

It would be possible to write some code to do what you need for your specific file, but to do so for a general case I believe would be impossible.

可以编写一些代码来为特定文件执行所需的操作,但对于一般情况,我认为这样做是不可能的。

I have written an article here that explains why I believe reflowing PDF is flawed: http://www.planetpdf.com/enterprise/article.asp?ContentID=PDF_Reflow_in_Microsoft_Word_2012_Is_it_any_good

我在这里写了一篇文章解释了为什么我认为回流 PDF 有缺陷:http://www.planetpdf.com/enterprise/article.asp?ContentID=PDF_Reflow_in_Microsoft_Word_2012_Is_it_any_good

Of particular interest is the paragraph beginning "Let's use a newspaper story to illustrate the problem."

特别有趣的是开头的段落“让我们用一个报纸故事来说明这个问题。”

You may want to look into what IDRsolutions (which for transparency, is where I work!) has to offer.

您可能想了解 IDRsolutions(为了透明度,这是我工作的地方!)必须提供什么。

We are currently in the process of putting our PDF to HTML5 and PDF Conversion software in the cloud: http://www.idrsolutions.com/cloud-pdf-converter/

我们目前正在将我们的 PDF 转 HTML5 和 PDF 转换软件放入云端:http: //www.idrsolutions.com/cloud-pdf-converter/

What may be a better fit for you is the PDF text extraction and PDF image extraction functionality of JPedal. It's quite likely we will look at putting this in the cloud also, if the PDF to HTML5 goes well.

JPedal 的 PDF 文本提取和 PDF 图像提取功能可能更适合您。如果 PDF 转 HTML5 进展顺利,我们很可能也会考虑将其放入云中。

Text Extraction: http://www.idrsolutions.com/pdf-to-text-conversion/

文本提取:http: //www.idrsolutions.com/pdf-to-text-conversion/

Image Extraction: http://www.idrsolutions.com/extract-images-from-pdf/

图像提取:http: //www.idrsolutions.com/extract-images-from-pdf/

回答by Heather McVay

What you are wanting to achieve from the graphic you posted is actually OCR conversion of a graphic. http://www.phpclasses.org/package/2874-PHP-Recognize-text-objects-in-graphical-images.html

您想要从您发布的图形中实现的实际上是图形的 OCR 转换。 http://www.phpclasses.org/package/2874-PHP-Recognize-text-objects-in-graphical-images.html