如何从 PDF 文件中删除所有图像/绘图并仅在 Java 中保留文本？

Question

提问by Maurício Linhares

I have a PDF file that's an output from an OCR processor, this OCR processor recognizes the image, adds the text to the pdf but at the end places a low quality image instead of the original one (I have no idea why anyone would do that, but they do).

我有一个 PDF 文件，它是 OCR 处理器的输出，该 OCR 处理器识别图像，将文本添加到 pdf 但最后放置的是低质量图像而不是原始图像（我不知道为什么有人会这样做，但他们这样做）。

So, I would like to get this PDF, remove the image stream and leave the text alone, so that I could get it and import (using iText page importing feature) to a PDF I'm creating myself with the real image.

所以，我想得到这个 PDF，删除图像流并留下文本，这样我就可以得到它并导入（使用 iText 页面导入功能）到我用真实图像创建的 PDF。

And before someone asks, I have already tried to use another tool to extract text coordinates (JPedal) but when I draw the text on my PDF it isn't at the same position as the original one.

在有人问之前，我已经尝试使用另一种工具来提取文本坐标 (JPedal)，但是当我在 PDF 上绘制文本时，它的位置与原始位置不同。

I'd rather have this done in Java, but if another tool can do it better, just let me know. And it could be image removal only, I can live with a PDF with the drawings in there.

我宁愿在 Java 中完成此操作，但如果其他工具可以做得更好，请告诉我。它可能只是图像删除，我可以忍受带有图纸的 PDF。

Answer 1

回答by IceGlow

I used Apache PDFBox in similar situation.

我在类似的情况下使用了 Apache PDFBox。

To be a little bit more specific, try something like that:

更具体一点，请尝试以下操作：

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import java.io.IOException;

public class Main {
    public static void main(String[] argv) throws COSVisitorException, InvalidPasswordException, CryptographyException, IOException {
        PDDocument document = PDDocument.load("input.pdf");

        if (document.isEncrypted()) {
            document.decrypt("");
        }

        PDDocumentCatalog catalog = document.getDocumentCatalog();
        for (Object pageObj :  catalog.getAllPages()) {
            PDPage page = (PDPage) pageObj;
            PDResources resources = page.findResources();
            resources.getImages().clear();
        }

        document.save("strippedOfImages.pdf");
    }
}

It's supposed to remove all types of images (png, jpeg, ...). It should work like that:

它应该删除所有类型的图像（png，jpeg，...）。它应该像这样工作：

.

Answer 2

回答by paf.goncalves

You need to parse the document as follows:

您需要按如下方式解析文档：

public static void strip(String pdfFile, String pdfFileOut) throws Exception {

    PDDocument doc = PDDocument.load(pdfFile);

    List pages = doc.getDocumentCatalog().getAllPages();
    for( int i=0; i<pages.size(); i++ ) {
        PDPage page = (PDPage)pages.get( i );

        // added
        COSDictionary newDictionary = new COSDictionary(page.getCOSDictionary());

        PDFStreamParser parser = new PDFStreamParser(page.getContents());
        parser.parse();
        List tokens = parser.getTokens();
        List newTokens = new ArrayList();
        for(int j=0; j<tokens.size(); j++) {
            Object token = tokens.get( j );

            if( token instanceof PDFOperator ) {
                PDFOperator op = (PDFOperator)token;
                if( op.getOperation().equals( "Do") ) {
                    //remove the one argument to this operator
                    // added
                    COSName name = (COSName)newTokens.remove( newTokens.size() -1 );
                    // added
                    deleteObject(newDictionary, name);
                    continue;
                }
            }
            newTokens.add( token );
        }
        PDStream newContents = new PDStream( doc );
        ContentStreamWriter writer = new ContentStreamWriter( newContents.createOutputStream() );
        writer.writeTokens( newTokens );
        newContents.addCompression();

        page.setContents( newContents );

        // added
        PDResources newResources = new PDResources(newDictionary);
        page.setResources(newResources);
    }

    doc.save(pdfFileOut);
    doc.close();
}


// added
public static boolean deleteObject(COSDictionary d, COSName name) {
    for(COSName key : d.keySet()) {
        if( name.equals(key) ) {
            d.removeItem(key);
            return true;
        }
        COSBase object = d.getDictionaryObject(key); 
        if(object instanceof COSDictionary) {
            if( deleteObject((COSDictionary)object, name) ) {
                return true;
            }
        }
    }
    return false;
}

如何从 PDF 文件中删除所有图像/绘图并仅在 Java 中保留文本？

提问by Maurício Linhares

回答by IceGlow

回答by paf.goncalves

相关推荐

最近更新

标签

如何从 PDF 文件中删除所有图像/绘图并仅在 Java 中保留文本？

提问by Maurício Linhares

回答by IceGlow

回答by paf.goncalves

相关推荐

java jna 加载库

Java - 避免在代码中使用长 SQL 查询

java 如何在itext中将页面大小设置为欧洲A4

我可以使用 Java 反射获取有关局部变量的信息吗？

相关推荐

最近更新

标签