java 使用java读取pdf文件

Question

提问by Rim

I want to parse pdf websites.

我想解析pdf网站。

Can anyone say how to extract all the words (word by word) from a pdf file using java.

任何人都可以说如何使用 java 从 pdf 文件中提取所有单词（逐字）。

The code below extract content from a pdf file and write it in another pdf file. I want that the program write it in a text file.

下面的代码从 pdf 文件中提取内容并将其写入另一个 pdf 文件。我希望程序将其写入文本文件中。

import java.io.FileOutputStream;

import java.io.IOException;

import com.itextpdf.text.*;

import com.itextpdf.text.pdf.*;

public class pdf {

    private static String INPUTFILE = "http://www.britishcouncil.org/learning-infosheets-medicine.pdf" ;

    private static String OUTPUTFILE = "c:/new3.pdf";

    public static void main(String[] args) throws DocumentException,
            IOException {

        Document document = new Document();

        PdfWriter writer = PdfWriter.getInstance(document,
                new FileOutputStream(OUTPUTFILE));

        document.open();

        PdfReader reader = new PdfReader(INPUTFILE);

        int n = reader.getNumberOfPages();

        PdfImportedPage page;


        for (int i = 1; i <= n; i++) {

                page = writer.getImportedPage(reader, i);

                Image instance = Image.getInstance(page);

                document.add(instance);

        }

        document.close();

    }

}

Thanks in advance

提前致谢

Answer 1

回答by Leniel Maccaferri

Take a look at this:

看看这个：

How to Read PDF File in Java(uses Apache PDF Box library)

如何在 Java 中读取 PDF 文件（使用 Apache PDF Box 库）

Answer 2

回答by dina

using org.apache.pdfbox

使用 org.apache.pdfbox

import org.apache.pdfbox.*;

public static String convertPDFToTxt(String filePath) {
        byte[] thePDFFileBytes = readFileAsBytes(filePath);
        PDDocument pddDoc = PDDocument.load(thePDFFileBytes);
        PDFTextStripper reader = new PDFTextStripper();
        String pageText = reader.getText(pddDoc);
        pddDoc.close();
        return pageText;
}

private static byte[] readFileAsBytes(String filePath) {
        FileInputStream inputStream = new FileInputStream(filePath);
        return IOUtils.toByteArray(inputStream);
}

java 使用java读取pdf文件

提问by Rim

回答by Leniel Maccaferri

回答by dina

相关推荐

最近更新

标签

java 使用java读取pdf文件

提问by Rim

回答by Leniel Maccaferri

回答by dina

相关推荐

java 拳击和加宽

使用 Java 的 XPath 循环节点并提取特定的子节点值

java ProcessBuilder 与 Runtime.exec()

java 如何在java中解析单词创建的特殊字符

相关推荐

最近更新

标签