java Apache Tika 提取扫描的 PDF 文件

Question

提问by LorisBachert

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway.

我在使用 Apache TIKA（1.10 版）时遇到了一些问题。我得到了一些 PDF 文件，它们只是扫描的纸片。这意味着每个页面只是一个图像。我的目标是无论如何提取 PDF 文件的文本。

My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling):

我的 tesseract 设置正确，提取 JPG 和 PNG 文件就像一个魅力。我正在使用的代码看起来像这样（不要介意缺少的异常处理）：

public String extractText(InputStream stream) {
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    parser.parse(stream, handler, metadata, context);
    String text = handler.toString();
    return text;
}

I searched a lot but i didn't find any solutions that work for me. I already tried the setExtractInlineImagesmethod of the PDFParserConfigclass but this didn't change a thing. Extracting embedded documents using a custom ParsingEmbeddedDocumentExtractordid extract embedded resources of a doc file but not for my PDF files.

我搜索了很多，但没有找到任何适合我的解决方案。我已经尝试了类的setExtractInlineImages方法，PDFParserConfig但这并没有改变任何事情。使用自定义提取嵌入文档ParsingEmbeddedDocumentExtractor确实提取了 doc 文件的嵌入资源，但不适用于我的 PDF 文件。

It would be awesome if anyone of you could provide some help :)

如果你们中的任何人都可以提供一些帮助，那就太棒了:)

Answer 1

回答by LorisBachert

Tim Allisonbrought the solution:

Tim Allison带来了解决方案：

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens!

parser.parse(stream, handler, new Metadata(), parseContext);

This works for me :)

这对我有用:)

EDIT:Here is the complete solution:

编辑：这是完整的解决方案：

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.FileInputStream;
import java.io.IOException;

/**
 * @since 8/26/16
 */
public class Sample {
    public static void main(String[] args)
            throws IOException, TikaException, SAXException {
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

        TesseractOCRConfig config = new TesseractOCRConfig();
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);

        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        //need to add this to make sure recursive parsing happens!
        parseContext.set(Parser.class, parser);

        FileInputStream stream = new FileInputStream("samplepdf.pdf");
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        System.out.println(metadata);
        String content = handler.toString();
        System.out.println("===============");
        System.out.println(content);
        System.out.println("Done");
    }
}

Maven Dependencies:

Maven 依赖项：

<dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.13</version>
    </dependency>
    <dependency>
      <groupId>com.levigo.jbig2</groupId>
      <artifactId>levigo-jbig2-imageio</artifactId>
      <version>1.6.5</version>
    </dependency>
  </dependencies>

java Apache Tika 提取扫描的 PDF 文件

提问by LorisBachert

回答by LorisBachert

相关推荐

最近更新

标签

java Apache Tika 提取扫描的 PDF 文件

提问by LorisBachert

回答by LorisBachert

相关推荐

java 找不到亚行。请使用Android SDK根目录路径设置ANDROID_HOME环境变量

Java 8：对 [method] 的引用不明确

java 使用 URL 时无法解决符号错误

java 球衣多方返回 NoClassDefFoundError: org/glassfish/jersey/internal/inject/ExtractorException

相关推荐

最近更新

标签