java Apache Tika 提取扫描的 PDF 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32354209/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-02 20:04:38  来源:igfitidea点击:

Apache Tika extract scanned PDF files

javapdfocrtesseractapache-tika

提问by LorisBachert

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway.

我在使用 Apache TIKA(1.10 版)时遇到了一些问题。我得到了一些 PDF 文件,它们只是扫描的纸片。这意味着每个页面只是一个图像。我的目标是无论如何提取 PDF 文件的文本。

My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling):

我的 tesseract 设置正确,提取 JPG 和 PNG 文件就像一个魅力。我正在使用的代码看起来像这样(不要介意缺少的异常处理):

public String extractText(InputStream stream) {
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    parser.parse(stream, handler, metadata, context);
    String text = handler.toString();
    return text;
}

I searched a lot but i didn't find any solutions that work for me. I already tried the setExtractInlineImagesmethod of the PDFParserConfigclass but this didn't change a thing. Extracting embedded documents using a custom ParsingEmbeddedDocumentExtractordid extract embedded resources of a doc file but not for my PDF files.

我搜索了很多,但没有找到任何适合我的解决方案。我已经尝试了类的setExtractInlineImages方法,PDFParserConfig但这并没有改变任何事情。使用自定义提取嵌入文档ParsingEmbeddedDocumentExtractor确实提取了 doc 文件的嵌入资源,但不适用于我的 PDF 文件。

It would be awesome if anyone of you could provide some help :)

如果你们中的任何人都可以提供一些帮助,那就太棒了:)

回答by LorisBachert

Tim Allisonbrought the solution:

Tim Allison带来了解决方案:

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens!

parser.parse(stream, handler, new Metadata(), parseContext);

This works for me :)

这对我有用:)

EDIT:Here is the complete solution:

编辑:这是完整的解决方案:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.FileInputStream;
import java.io.IOException;

/**
 * @since 8/26/16
 */
public class Sample {
    public static void main(String[] args)
            throws IOException, TikaException, SAXException {
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

        TesseractOCRConfig config = new TesseractOCRConfig();
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);

        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        //need to add this to make sure recursive parsing happens!
        parseContext.set(Parser.class, parser);

        FileInputStream stream = new FileInputStream("samplepdf.pdf");
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        System.out.println(metadata);
        String content = handler.toString();
        System.out.println("===============");
        System.out.println(content);
        System.out.println("Done");
    }
}

Maven Dependencies:

Maven 依赖项:

<dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.13</version>
    </dependency>
    <dependency>
      <groupId>com.levigo.jbig2</groupId>
      <artifactId>levigo-jbig2-imageio</artifactId>
      <version>1.6.5</version>
    </dependency>
  </dependencies>