如何从 PDF 文件中删除所有图像/绘图并仅在 Java 中保留文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6831194/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 17:28:19  来源:igfitidea点击:

How can I remove all images/drawings from a PDF file and leave text only in Java?

javapdfitext

提问by Maurício Linhares

I have a PDF file that's an output from an OCR processor, this OCR processor recognizes the image, adds the text to the pdf but at the end places a low quality image instead of the original one (I have no idea why anyone would do that, but they do).

我有一个 PDF 文件,它是 OCR 处理器的输出,该 OCR 处理器识别图像,将文本添加到 pdf 但最后放置的是低质量图像而不是原始图像(我不知道为什么有人会这样做,但他们这样做)。

So, I would like to get this PDF, remove the image stream and leave the text alone, so that I could get it and import (using iText page importing feature) to a PDF I'm creating myself with the real image.

所以,我想得到这个 PDF,删除图像流并留下文本,这样我就可以得到它并导入(使用 iText 页面导入功能)到我用真实图像创建的 PDF。

And before someone asks, I have already tried to use another tool to extract text coordinates (JPedal) but when I draw the text on my PDF it isn't at the same position as the original one.

在有人问之前,我已经尝试使用另一种工具来提取文本坐标 (JPedal),但是当我在 PDF 上绘制文本时,它的位置与原始位置不同。

I'd rather have this done in Java, but if another tool can do it better, just let me know. And it could be image removal only, I can live with a PDF with the drawings in there.

我宁愿在 Java 中完成此操作,但如果其他工具可以做得更好,请告诉我。它可能只是图像删除,我可以忍受带有图纸的 PDF。

回答by IceGlow

I used Apache PDFBox in similar situation.

我在类似的情况下使用了 Apache PDFBox。

To be a little bit more specific, try something like that:

更具体一点,请尝试以下操作:

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import java.io.IOException;

public class Main {
    public static void main(String[] argv) throws COSVisitorException, InvalidPasswordException, CryptographyException, IOException {
        PDDocument document = PDDocument.load("input.pdf");

        if (document.isEncrypted()) {
            document.decrypt("");
        }

        PDDocumentCatalog catalog = document.getDocumentCatalog();
        for (Object pageObj :  catalog.getAllPages()) {
            PDPage page = (PDPage) pageObj;
            PDResources resources = page.findResources();
            resources.getImages().clear();
        }

        document.save("strippedOfImages.pdf");
    }
}

It's supposed to remove all types of images (png, jpeg, ...). It should work like that:

它应该删除所有类型的图像(png,jpeg,...)。它应该像这样工作:

Sample article.

示例文章.

回答by paf.goncalves

You need to parse the document as follows:

您需要按如下方式解析文档:

public static void strip(String pdfFile, String pdfFileOut) throws Exception {

    PDDocument doc = PDDocument.load(pdfFile);

    List pages = doc.getDocumentCatalog().getAllPages();
    for( int i=0; i<pages.size(); i++ ) {
        PDPage page = (PDPage)pages.get( i );

        // added
        COSDictionary newDictionary = new COSDictionary(page.getCOSDictionary());

        PDFStreamParser parser = new PDFStreamParser(page.getContents());
        parser.parse();
        List tokens = parser.getTokens();
        List newTokens = new ArrayList();
        for(int j=0; j<tokens.size(); j++) {
            Object token = tokens.get( j );

            if( token instanceof PDFOperator ) {
                PDFOperator op = (PDFOperator)token;
                if( op.getOperation().equals( "Do") ) {
                    //remove the one argument to this operator
                    // added
                    COSName name = (COSName)newTokens.remove( newTokens.size() -1 );
                    // added
                    deleteObject(newDictionary, name);
                    continue;
                }
            }
            newTokens.add( token );
        }
        PDStream newContents = new PDStream( doc );
        ContentStreamWriter writer = new ContentStreamWriter( newContents.createOutputStream() );
        writer.writeTokens( newTokens );
        newContents.addCompression();

        page.setContents( newContents );

        // added
        PDResources newResources = new PDResources(newDictionary);
        page.setResources(newResources);
    }

    doc.save(pdfFileOut);
    doc.close();
}


// added
public static boolean deleteObject(COSDictionary d, COSName name) {
    for(COSName key : d.keySet()) {
        if( name.equals(key) ) {
            d.removeItem(key);
            return true;
        }
        COSBase object = d.getDictionaryObject(key); 
        if(object instanceof COSDictionary) {
            if( deleteObject((COSDictionary)object, name) ) {
                return true;
            }
        }
    }
    return false;
}