java 从pdf文件中提取文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4026614/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 04:27:05  来源:igfitidea点击:

extract text from pdf files

javaparsingpdfitext

提问by Rim

I need to extract text (word by word) from a pdf file.

我需要从 pdf 文件中提取文本(逐字)。

import java.io.*;

import com.itextpdf.text.*;

import com.itextpdf.text.pdf.*;

import com.itextpdf.text.pdf.parser.*;

public class pdf {

    private static String INPUTFILE = "http://ontology.buffalo.edu/ontology%28PIC%29.pdf" ;

    private static String OUTPUTFILE = "c:/new3.pdf";

    public static void main(String[] args) throws DocumentException,
            IOException {

        Document document = new Document();

        PdfWriter writer = PdfWriter.getInstance(document,

        new FileOutputStream(OUTPUTFILE));

        document.open();

        PdfReader reader = new PdfReader(INPUTFILE);

        int n = reader.getNumberOfPages();

        PdfImportedPage page;

        // Go through all pages

        for (int i = 1; i <= n; i++) {

                page = writer.getImportedPage(reader, i);

                System.out.println(i);


                Image instance = Image.getInstance(page);

                document.add(instance);

        }

        document.close();


        PdfReader readerN = new PdfReader(OUTPUTFILE);

        PdfTextExtractor parse = new PdfTextExtractor();

for (int i = 1; i <= n; i++) 

System.out.println(parser.getTextFromPage(reader,i));


}

When I compile the code, I have this error:

当我编译代码时,出现以下错误:

the constructor PdfTextExtractor is undefined

构造函数 PdfTextExtractor 未定义

How do I fix this?

我该如何解决?

回答by Woot4Moo

PDFTextExtractor only contains static methods and the constructor is private. itext

PDFTextExtractor 仅包含静态方法并且构造函数是私有的。 文本

You can call it like so:
String myLine = PDFTextExtractor.getTextFromPage(reader, pageNumber)

你可以这样称呼它:
String myLine = PDFTextExtractor.getTextFromPage(reader, pageNumber)

回答by Chandubabu

If you want to get all the text from the PDF file and save it to a text file you can use below code.

如果您想从 PDF 文件中获取所有文本并将其保存到文本文件中,您可以使用以下代码。

Use pdfutil.jarlibrary.

使用pdfutil.jar库。

import java.io.IOException;
import java.io.PrintWriter;

import com.testautomationguru.utility.PDFUtil;

public class PDFToText{

    public static void main(String[] args) {

        try {
            String pdfFilePath = "C:\abc.pdf";
            PDFUtil pdfUtil = new PDFUtil();
            String content = pdfUtil.getText(pdfFilePath);
            PrintWriter out = new PrintWriter("C:\abc.txt");
            out.println(content);
            out.close();

        } catch (IOException e) {

            e.printStackTrace();
        }
    }

}

回答by Bae Cheol Shin

// Try Apache PDF Box
import java.io.FilterInputStream;
import java.io.InputStream;

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

// Your PDF file
String filePath = "";
InputStream inputStream = null;

try 
{
    inputStream = new FileInputStream(filePath);
    PDFParser parser = new PDFParser(inputStream);

    // This will parse the stream and populate the COSDocument object.
    parser.parse();

    // Get the document that was parsed.
    COSDocument cosDoc = parser.getDocument();

    // This class will take a pdf document and strip out all of the text and 
    // ignore the formatting and such.
    PDFTextStripper pdfStripper = new PDFTextStripper();

    // This is the in-memory representation of the PDF document
    PDDocument pdDoc = new PDDocument(cosDoc);
    pdfStripper.setStartPage(1);
    pdfStripper.setEndPage(pdDoc.getNumberOfPages());

    // This will return the text of a document.
    def statementPDF = pdfStripper.getText(pdDoc); 
} 
catch(Exception e)
{
    String errorMessage += "\nUnexpected Exception: "  + e.getClass() + "\n" + e.getMessage();
    for (trace in e.getStackTrace())
    {
        errorMessage += "\n\t" + trace;
    }
}
finally
{
   if (inputStream != null)
   {
      inputStream.close();
   }
}