java 从 PDF 到字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1678435/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 17:34:36  来源:igfitidea点击:

From PDf to String

javapdftextio

提问by Ankur

What is the easiest way to get the text (words) of a PDF file as one long String or array of Strings.

将 PDF 文件的文本(单词)作为一个长字符串或字符串数​​组获取的最简单方法是什么。

I have tried pdfbox but that is not working for me.

我试过 pdfbox 但这对我不起作用。

回答by Kushal Paudyal

use iText. The following snippet for example will extract the text.

使用 iText。例如,以下代码段将提取文本。

PdfTextExtractor parser =new PdfTextExtractor(new PdfReader("C:/Text.pdf"));
parser.getTextFromPage(3);

回答by Sam Barnum

PDFBox barfs on many newer PDFs, especially those with embedded PNG images.

PDFBox 在许多较新的 PDF 上出现故障,尤其是那些嵌入了 PNG 图像的 PDF。

I was very impressed with PDFTextStream

我对PDFTextStream印象深刻

回答by mark stephens

JPedaland Multivalentalso offer text extraction in Javaor you could access xpdfusing Runtime.exec

JPedalMultivalent提供文本提取,Java或者您可以xpdf使用Runtime.exec

回答by yeaaaahhhh..hamf hamf

Well, i have used Tika in order to extract raw text from pdf(it is based on PDFBox), but i think Tika is useful only when you have to extract text from different file formats(auto detection helps a lot).

好吧,我已经使用 Tika 从 pdf 中提取原始文本(它基于 PDFBox),但我认为 Tika 仅在您必须从不同文件格式中提取文本时才有用(自动检测有很大帮助)。

If you want to parse only pdf's into text i would suggest PDFTextStreambecause it's a much better parser than other apis(such as iText and PDFBox).

如果您只想将 pdf 解析为文本,我会建议使用PDFTextStream,因为它比其他 api(例如 iText 和 PDFBox)要好得多。

With PDFTextStream you can easily get structured text (pages->blocks->lines->textUnits), and it gives you the possibility to extract correlated info such as character encoding, height, location of a character in the page etc..

使用 PDFTextStream,您可以轻松获得结构化文本(页面->块->行->textUnits),它使您可以提取相关信息,例如字符编码、高度、字符在页面中的位置等。

Example:

例子:

public class ExtractTextAllPages {
    public static void main (String[] args) throws IOException {
        String pdfFilePath = args[0];
        PDFTextStream pdfts = new PDFTextStream(pdfFilePath); 
        StringBuilder text = new StringBuilder(1024);
        pdfts.pipe(new OutputTarget(text));
        pdfts.close();
        System.out.printf("The text extracted from %s is:", pdfFilePath);
        System.out.println(text);
    }
}