java 将 PDF 转换为多页 tiff(第 4 组)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31973354/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-02 19:26:40  来源:igfitidea点击:

Converting PDF to multipage tiff (Group 4)

javapdfpdfboxtifficafe

提问by Raphael Roth

I'm trying to convert PDFs as represented by the org.apache.pdfbox.pdmodel.PDDocument class and the icafe library (https://github.com/dragon66/icafe/) to a multipage tiff with group 4 compression and 300 dpi. The sample code works for me for 288 dpi but strangely NOT for 300 dpi, the exported tiff remains just white. Has anybody an idea what the issue is here?

我正在尝试将 org.apache.pdfbox.pdmodel.PDDocument 类和 icafe 库(https://github.com/dragon66/icafe/)表示的 PDF 转换为具有第 4 组压缩和 300 dpi 的多页 tiff . 示例代码适用于 288 dpi,但奇怪的是不适用于 300 dpi,导出的 tiff 仍然只是白色。有人知道这里的问题是什么吗?

The sample pdf which I use in the example is located here: http://www.bergophil.ch/a.pdf

我在示例中使用的示例 pdf 位于此处:http: //www.bergophil.ch/a.pdf

import java.awt.image.BufferedImage;
import java.io.FileOutputStream;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;

import cafe.image.ImageColorType;
import cafe.image.ImageParam;
import cafe.image.options.TIFFOptions;
import cafe.image.tiff.TIFFTweaker;
import cafe.image.tiff.TiffFieldEnum.Compression;
import cafe.io.FileCacheRandomAccessOutputStream;
import cafe.io.RandomAccessOutputStream;

public class Pdf2TiffConverter {
    public static void main(String[] args) {
        String pdf = "a.pdf";
        PDDocument pddoc = null;
        try {
            pddoc = PDDocument.load(pdf);
        } catch (IOException e) {
        }

        try {
            savePdfAsTiff(pddoc);
        } catch (IOException e) {
        }
    }

    private static void savePdfAsTiff(PDDocument pdf) throws IOException {
        BufferedImage[] images = new BufferedImage[pdf.getNumberOfPages()];
        for (int i = 0; i < images.length; i++) {
            PDPage page = (PDPage) pdf.getDocumentCatalog().getAllPages()
                    .get(i);
            BufferedImage image;
            try {
//              image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 288); //works
                image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 300); // does not work
                images[i] = image;
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

        FileOutputStream fos = new FileOutputStream("a.tiff");
        RandomAccessOutputStream rout = new FileCacheRandomAccessOutputStream(
                fos);
        ImageParam.ImageParamBuilder builder = ImageParam.getBuilder();
        ImageParam[] param = new ImageParam[1];
        TIFFOptions tiffOptions = new TIFFOptions();
        tiffOptions.setTiffCompression(Compression.CCITTFAX4);
        builder.imageOptions(tiffOptions);
        builder.colorType(ImageColorType.BILEVEL);
        param[0] = builder.build();
        TIFFTweaker.writeMultipageTIFF(rout, param, images);
        rout.close();
        fos.close();
    }
}

Or is there another library to write multi-page TIFFs?

或者是否有另一个库来编写多页 TIFF?

EDIT:

编辑:

Thanks to dragon66 the bug in icafeis now fixed. In the meantime I experimented with other libraries and also with invoking ghostscript. As I think ghostscriptis very reliable as id is a widely used tool, on the other hand I have to rely that the user of my code has an ghostscript-installation, something like this:

多亏了dragon66,icafe现在修复了这个错误。与此同时,我尝试了其他库并调用ghostscript. 我认为ghostscript非常可靠,因为 id 是一种广泛使用的工具,另一方面,我必须依赖我的代码的用户有一个ghostscript-installation,如下所示:

   /**
 * Converts a given pdf as specified by its path to an tiff using group 4 compression
 *
 * @param pdfFilePath The absolute path of the pdf
 * @param tiffFilePath The absolute path of the tiff to be created
 * @param dpi The resolution of the tiff
 * @throws MyException If the conversion fails
 */
private static void convertPdfToTiffGhostscript(String pdfFilePath, String tiffFilePath, int dpi) throws MyException {
    // location of gswin64c.exe
    String ghostscriptLoc = context.getGhostscriptLoc();

    // enclose src and dest. with quotes to avoid problems if the paths contain whitespaces
    pdfFilePath = "\"" + pdfFilePath + "\"";
    tiffFilePath = "\"" + tiffFilePath + "\"";

    logger.debug("invoking ghostscript to convert {} to {}", pdfFilePath, tiffFilePath);
    String cmd = ghostscriptLoc + " -dQUIET -dBATCH -o " + tiffFilePath + " -r" + dpi + " -sDEVICE=tiffg4 " + pdfFilePath;
    logger.debug("The following command will be invoked: {}", cmd);

    int exitVal = 0;
    try {
        exitVal = Runtime.getRuntime().exec(cmd).waitFor();
    } catch (Exception e) {
        logger.error("error while converting to tiff using ghostscript", e);
        throw new MyException(ErrorMessages.GHOSTSTSCRIPT_ERROR, e);
    }
    if (exitVal != 0) {
        logger.error("error while converting to tiff using ghostscript, exitval is {}", exitVal);
        throw new MyException(ErrorMessages.GHOSTSTSCRIPT_ERROR);
    }
}

I found that the produced tiffrom ghostscriptstrongly differs in quality from the tiffproduced by icafe(the group 4 tifffrom ghostscriptlooks greyscale-like)

我发现,所产生的tifghostscript强烈的不同之处的质量从tiff由产生的icafe(该基团4tiffghostscript长相灰度图样)

回答by dragon66

It's been a while since the question was asked and I finally find time and a wonderful ordered dither matrix which allows me to give some details on how "icafe" can be used to get similar or better results than calling external ghostscript executable. Some new features were added to "icafe" recently such as better quantization and ordered dither algorithms which is used in the following example code.

自从提出这个问题已经有一段时间了,我终于找到时间和一个美妙的有序抖动矩阵,它允许我提供一些关于如何使用“icafe”来获得与调用外部 ghostscript 可执行文件相似或更好的结果的详细信息。最近向“icafe”添加了一些新功能,例如更好的量化和有序抖动算法,这些算法在以下示例代码中使用。

Here the sample pdf I am going to use is princeCatalogue. Most of the following code is from the OP with some changes due to package name change and more ImageParam control settings.

这里我要使用的示例pdf是princeCatalogue。以下大部分代码来自 OP,由于包名更改和更多 ImageParam 控件设置而进行了一些更改。

import java.awt.image.BufferedImage;
import java.io.FileOutputStream;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;

import com.icafe4j.image.ImageColorType;
import com.icafe4j.image.ImageParam;
import com.icafe4j.image.options.TIFFOptions;
import com.icafe4j.image.quant.DitherMethod;
import com.icafe4j.image.quant.DitherMatrix;
import com.icafe4j.image.tiff.TIFFTweaker;
import com.icafe4j.image.tiff.TiffFieldEnum.Compression;
import com.icafe4j.io.FileCacheRandomAccessOutputStream;
import com.icafe4j.io.RandomAccessOutputStream;

public class Pdf2TiffConverter {
    public static void main(String[] args) {
        String pdf = "princecatalogue.pdf";
        PDDocument pddoc = null;
        try {
            pddoc = PDDocument.load(pdf);
        } catch (IOException e) {
        }

        try {
            savePdfAsTiff(pddoc);
        } catch (IOException e) {
        }
    }

    private static void savePdfAsTiff(PDDocument pdf) throws IOException {
        BufferedImage[] images = new BufferedImage[pdf.getNumberOfPages()];
        for (int i = 0; i < images.length; i++) {
            PDPage page = (PDPage) pdf.getDocumentCatalog().getAllPages()
                    .get(i);
            BufferedImage image;
            try {
//              image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 288); //works
                image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 300); // does not work
                images[i] = image;
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

        FileOutputStream fos = new FileOutputStream("a.tiff");
        RandomAccessOutputStream rout = new FileCacheRandomAccessOutputStream(
                fos);
        ImageParam.ImageParamBuilder builder = ImageParam.getBuilder();
        ImageParam[] param = new ImageParam[1];
        TIFFOptions tiffOptions = new TIFFOptions();
        tiffOptions.setTiffCompression(Compression.CCITTFAX4);
        builder.imageOptions(tiffOptions);
        builder.colorType(ImageColorType.BILEVEL).ditherMatrix(DitherMatrix.getBayer8x8Diag()).applyDither(true).ditherMethod(DitherMethod.BAYER);
        param[0] = builder.build();
        TIFFTweaker.writeMultipageTIFF(rout, param, images);
        rout.close();
        fos.close();
    }
}

For ghostscript, I used command line directly with the same parameters provided by the OP. The screenshots for the first page of the resulted TIFF images are showing below:

对于 ghostscript,我直接使用命令行和 OP 提供的相同参数。生成的 TIFF 图像第一页的屏幕截图如下所示:

enter image description here

在此处输入图片说明

The lefthand side shows the output of "ghostscript" and the righthand side the output of "icafe". It can be seen, at least in this case, the output from "icafe" is better than the output from "ghostscript".

左侧显示“ghostscript”的输出,右侧显示“icafe”的输出。可以看出,至少在这种情况下,“icafe”的输出优于“ghostscript”的输出。

Using CCITTFAX4 compression, the file size from "ghostscript" is 2.22M and the file size from "icafe" is 2.08M. Both are not so good given the fact dither is used while creating the black and white output. In fact, a different compression algorithm will create way smaller file size. For example, using LZW, the same output from "icafe" is only 634K and if using DEFLATE compression the output file size went down to 582K.

使用 CCITTFAX4 压缩,"ghostscript" 中的文件大小为 2.22M,"icafe" 中的文件大小为 2.08M。考虑到在创建黑白输出时使用了抖动,两者都不太好。事实上,不同的压缩算法将创建更小的文件大小。例如,使用 LZW,“icafe”的相同输出仅为 634K,如果使用 DEFLATE 压缩,则输出文件大小下降到 582K。

回答by Tilman Hausherr

Here's some code to save in a multipage tiff which I use with PDFBox. It requires the TIFFUtil classfrom PDFBox (it isn't public, so you have to make a copy).

这里有一些代码可以保存在我与 PDFBox 一起使用的多页 tiff 中。它需要来自 PDFBox的TIFFUtil 类(它不是公开的,因此您必须进行复制)。

void saveAsMultipageTIFF(ArrayList<BufferedImage> bimTab, String filename, int dpi) throws IOException
{
    Iterator<ImageWriter> writers = ImageIO.getImageWritersByFormatName("tiff");
    ImageWriter imageWriter = writers.next();

    ImageOutputStream ios = ImageIO.createImageOutputStream(new File(filename));
    imageWriter.setOutput(ios);
    imageWriter.prepareWriteSequence(null);
    for (BufferedImage image : bimTab)
    {
        ImageWriteParam param = imageWriter.getDefaultWriteParam();
        IIOMetadata metadata = imageWriter.getDefaultImageMetadata(new ImageTypeSpecifier(image), param);
        param.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
        TIFFUtil.setCompressionType(param, image);
        TIFFUtil.updateMetadata(metadata, image, dpi);
        imageWriter.writeToSequence(new IIOImage(image, null, metadata), param);
    }
    imageWriter.endWriteSequence();
    imageWriter.dispose();
    ios.flush();
    ios.close();
}

I experimented on this for myself some time ago by using this code: https://www.java.net/node/670205(I used solution 2)

前段时间我使用以下代码为自己进行了实验:https: //www.java.net/node/670205(我使用了解决方案2)

However...

然而...

If you create an array with lots of images, your memory consumption really goes up. So it would probably be better to render an image, then add it to the tiff file, then render the next page and lose the reference of the previous one so that the gc can get the space if needed.

如果你创建一个包含大量图像的数组,你的内存消耗真的会增加。因此,渲染图像,然后将其添加到 tiff 文件中,然后渲染下一页并丢失前一页的引用可能会更好,以便 gc 可以在需要时获得空间。

回答by James

Since some dependencies used by solutions for this problem looks not maintained. I got a solution by using latest version (2.0.16) pdfbox:

由于此问题的解决方案使用的某些依赖项看起来没有得到维护。我通过使用最新版本 (2.0.16) 得到了一个解决方案pdfbox

ByteArrayOutputStream imageBaos = new ByteArrayOutputStream();
ImageOutputStream output = ImageIO.createImageOutputStream(imageBaos);
ImageWriter writer = ImageIO.getImageWritersByFormatName("TIFF").next();

try (final PDDocument document = PDDocument.load(new File("/tmp/tmp.pdf"))) {

            PDFRenderer pdfRenderer = new PDFRenderer(document);

            int pageCount = document.getNumberOfPages();

            BufferedImage[] images = new BufferedImage[pageCount];
            // ByteArrayOutputStream[] baosArray = new ByteArrayOutputStream[pageCount];

            writer.setOutput(output);

            ImageWriteParam params = writer.getDefaultWriteParam();

            params.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);

            // Compression: None, PackBits, ZLib, Deflate, LZW, JPEG and CCITT
            // variants allowed
            params.setCompressionType("Deflate");

            writer.prepareWriteSequence(null);

            for (int page = 0; page < pageCount; page++) {
                BufferedImage image = pdfRenderer.renderImageWithDPI(page, DPI, ImageType.RGB);
                images[page] = image;
                IIOMetadata metadata = writer.getDefaultImageMetadata(new ImageTypeSpecifier(image), params);
                writer.writeToSequence(new IIOImage(image, null, metadata), params);
                // ImageIO.write(image, "tiff", baosArray[page]);
            }

            System.out.println("imageBaos size: " + imageBaos.size());
            // Finished write to output

            writer.endWriteSequence();

            document.close();
        } catch (IOException e) {
            e.printStackTrace();
            throw new Exception(e);
        } finally {
            // avoid memory leaks
            writer.dispose();
        }

Then you may using imageBaoswrite to your local file. But if you want to pass your image to ByteArrayOutputStreamand return to privious method like me. Then we need other steps.

然后您可以使用imageBaos写入本地文件。但是,如果您想将图像传递给ByteArrayOutputStream并返回到像我这样的原始方法。然后我们需要其他步骤。

After processing is done, the image bytes would be available in the ImageOutputStream outputobject. We need to position the offset to the beginning of the outputobject and then read the butes to write to new ByteArrayOutputStream, a concise way like this:

处理完成后,图像字节将在ImageOutputStream output对象中可用。我们需要将偏移量定位到output对象的开头,然后读取 butes 以写入 new ByteArrayOutputStream,一种简洁的方式如下:

ByteArrayOutputStream bos = new ByteArrayOutputStream();
long counter = 0; 
        while (true) {
            try {
                bos.write(ios.readByte());
                counter++;
            } catch (EOFException e) {
                System.out.println("End of Image Stream");
                break;
            } catch (IOException e) {
                System.out.println("Error processing the Image Stream");
                break;
            }
        }
return bos

Or you can just ImageOutputStream.flush()at end to get your imageBaosByte then return.

或者您可以ImageOutputStream.flush()在最后获取您的imageBaos字节然后返回。

回答by Jasper Lankhorst

Inspired by Yusaku answer,

受优作回答的启发,

I made my own version,

我自己制作的版本,

This can convert multiple pdf pages to a byte array.

这可以将多个 pdf 页面转换为字节数组。

I Used pdfbox 2.0.16 in combination with imageio-tiff 3.4.2

我将 pdfbox 2.0.16 与 imageio-tiff 3.4.2 结合使用

//PDF converter to tiff toolbox method.
private byte[] bytesToTIFF(@Nonnull byte[] in) {

        int dpi = 300;
        ImageWriter writer = ImageIO.getImageWritersByFormatName("TIFF").next();

        try(ByteArrayOutputStream imageBaos = new ByteArrayOutputStream(255)){

            writer.setOutput(ImageIO.createImageOutputStream(imageBaos));
            writer.prepareWriteSequence(null);

            PDDocument document = PDDocument.load(in);
            PDFRenderer pdfRenderer = new PDFRenderer(document);
            ImageWriteParam params = writer.getDefaultWriteParam();

            for (int page = 0; page < document.getNumberOfPages(); page++) {
                BufferedImage image = pdfRenderer.renderImageWithDPI(page, dpi, ImageType.RGB);
                IIOMetadata metadata = writer.getDefaultImageMetadata(new ImageTypeSpecifier(image), params);
                writer.writeToSequence(new IIOImage(image, null, metadata), params);
            }

            LOG.trace("size found: {}", imageBaos.size());

            writer.endWriteSequence();
            writer.reset();

            return imageBaos.toByteArray();

        } catch (Exception ex) {
            LOG.warn("can't instantiate the bytesToTiff method with: PDF", ex);
        } finally {
            writer.dispose();
        }
}

回答by Shishir Mane

Refer to my github codefor an implementation with PDFBox.

有关 PDFBox 的实现,请参阅我的 github代码