java PDF转文本工具还是Java库?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/583615/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PDF to text tool or Java library?
提问by Gary Kephart
I need to convert a PDF to normal text (it's the "statement of votes" from our county registrar). The files are big (2000 pages or so) and mostly contain tables. Once I get it into text, then I'm going to use a program I'm writing to parse it and put the data into a database. I've tried the 'Save as text' function in Adobe Reader, but it is not as precise as I'd like it, especially in delimiting the table data into CSV. So, any recommendations for tools or Java libraries that would do the trick?
我需要将 PDF 转换为普通文本(这是我们县登记员的“投票声明”)。这些文件很大(大约 2000 页)并且主要包含表格。一旦我把它变成文本,然后我将使用我正在编写的程序来解析它并将数据放入数据库中。我已经在 Adobe Reader 中尝试了“另存为文本”功能,但它没有我想要的那么精确,尤其是在将表格数据分隔为 CSV 时。那么,有什么关于工具或 Java 库的建议可以解决问题吗?
回答by Michael Myers
Well, there is iText. I have only limited experience with it, but it seemsit can do what you want.
嗯,有iText。我对它的经验有限,但它似乎可以做你想做的事。
Apache PDFBoxsurely can do it. Its site mentions "PDF to text extraction" as its top feature. There's an ExtractText command line toolspecifically for this (source code), based on its PDFTextStripper class. And there's a PDFBox Text Extraction Guide, too!
Apache PDFBox肯定可以做到。它的网站提到“PDF 到文本提取”是它的首要功能。有一个专门用于此的ExtractText 命令行工具(源代码),基于其PDFTextStripper 类。还有一个 PDFBox文本提取指南!
回答by Arjan
Given the title of the question: Apache Tikaworked very well for me to extract plain text from PDF. I've not used it to get text from tables though.
鉴于问题的标题:Apache Tika非常适合我从 PDF 中提取纯文本。不过,我还没有用它来从表格中获取文本。
For PDF it's actually using PDFBox. But besides PDF, it does the same for other formats like Microsoft Word (doc and docx), Excel and PowerPoint, OpenOffice.org/LibreOffice ODT, HTML, XML, and many more. Its AutoDetectParser makes fetching text from any input easy.
对于 PDF,它实际上使用的是PDFBox。但是,除了PDF,它为其他格式,如Microsoft Word(DOC和DOCX),Excel和PowerPoint,OpenOffice.org/LibreOffice ODT,HTML,XML,和相同的还有更多。它的 AutoDetectParser 使从任何输入中获取文本变得容易。
And if one needs to process the resulting text (like by passing it to Mahoutfor classification) one can use ParsingReaderto get the result into a Reader while a background process extracts it. Finally, while extrating the content, it also fills the meta data it finds:
如果需要处理结果文本(例如将其传递给Mahout进行分类),可以使用ParsingReader将结果放入 Reader 中,同时后台进程提取它。最后,在提取内容的同时,它还填充它找到的元数据:
public Reader getPlainTextReader(final InputStream is) {
try {
Detector detector = new DefaultDetector();
Parser parser = new AutoDetectParser(detector);
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
Metadata metadata = new Metadata();
Reader reader = new ParsingReader(parser, is, metadata, context);
for (String name : metadata.names()) {
for (String value : metadata.getValues(name)) {
logger.debug("Document {}: {}", name, value);
}
}
return reader;
} catch (IOException e) {
...
}
}
回答by Jarod Elliott
I have always found the xpdftools very useful.
我一直发现xpdf工具非常有用。
We successfully use the pdf to text conversion for converting PDF business documents for use in EDI. The option to preserve layout works well to keep things positioned well for parsing in a program.
我们成功地使用 pdf 到文本转换来转换 PDF 业务文档以用于 EDI。保留布局的选项可以很好地保持事物的位置,以便在程序中进行解析。
回答by cemerick
PDFTextStreamis our Java + .NET library for extracting content from PDF documents; you might give it a shot. Additionally, it does provide some rudimentary table data extraction utilities, which sit on top of PDFTextStream's table detection capabilities. It's by no means a general solution (though we're working on one of those, too!), but if the tabular data is clearly defined (e.g. rows and columns bounded by lines, etc), then you may find what's there now a proper solution.
PDFTextStream是我们的 Java + .NET 库,用于从 PDF 文档中提取内容;你可以试一试。此外,它确实提供了一些基本的表格数据提取实用程序,它们位于 PDFTextStream 的表格检测功能之上。这绝不是一个通用的解决方案(尽管我们也在研究其中之一!),但是如果表格数据被明确定义(例如行和列以行为界等),那么您可能会发现现在有什么适当的解决方案。
回答by SacramentoJoe
I use iText and I"ve been really happy with it. I've used xmlpdf before and iText is far superior in my opinion.
我使用 iText 并且我对它非常满意。我之前使用过 xmlpdf 并且 iText 在我看来要优越得多。
回答by Steve Claridge
Without knowing the layout of the pages in your PDF it is difficult to say.
如果不知道 PDF 中页面的布局,就很难说。
I would suggest downloading and trying both iText and PDBox. You will find text extract examples for both on their websites - you should have an extracter running in < 30mins assuming you know your way around Java.
我建议下载并尝试 iText 和 PDBox。您可以在他们的网站上找到两者的文本提取示例 - 假设您了解 Java,您应该在 < 30 分钟内运行一个提取器。
Start with PDFBox as it's text extraction abilities are better than iText's.
从 PDFBox 开始,因为它的文本提取能力比 iText 更好。
Someone else has mentioned xpdf and that may be useful for you. It's a C library with some command line tools built around it. It has a number of text extracters and you may be able to format the output easily enough. Again, it really depends on your page layout.
其他人提到了 xpdf,这可能对您有用。它是一个 C 库,其中包含一些围绕它构建的命令行工具。它有许多文本提取器,您可以轻松地格式化输出。同样,这实际上取决于您的页面布局。
回答by dirkgently
Use a text (line) printer to print to file.
使用文本(行)打印机打印到文件。

