java 从 PDF 中提取数据的最简单方法是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6831765/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the easiest way to extract data from a PDF?
提问by Sebastian Fork
I need to extract data from some PDF documents (using Java). I need to know what would be the easiest way to do it.
我需要从一些 PDF 文档中提取数据(使用 Java)。我需要知道什么是最简单的方法。
I tried iText. It's fairly complicated for my needs. Besides I guess it is not available for free for commercial projects. So it is not an option. I also gave a try to PDFBox, and ran into various NoClassDefFoundError
errors.
我试过 iText。这对我的需求来说相当复杂。此外,我想它不可免费用于商业项目。所以这不是一个选择。我还尝试了 PDFBox,但遇到了各种NoClassDefFoundError
错误。
I googled and came across several other options such as PDF Clown, jPod, but I do not have time to experiment with all of these libraries. I am relying on community's experience with PDF reading thru Java.
我在谷歌上搜索并发现了其他几个选项,例如 PDF Clown、jPod,但我没有时间尝试所有这些库。我依靠社区通过 Java 阅读 PDF 的经验。
Note that I do not need to create or manipulate PDF documents. I just need to exrtract textual data from PDF documents with moderate level layout complexity.
请注意,我不需要创建或操作 PDF 文档。我只需要从具有中等布局复杂性的 PDF 文档中提取文本数据。
Please suggest the quickest and easiest way to extract text from PDF documents. Thanks.
请建议从 PDF 文档中提取文本的最快和最简单的方法。谢谢。
回答by Kyle
I recommend trying Apache Tika. Apache Tika is basically a toolkit that extracts data from many types of documents, including PDFs.
我建议尝试Apache Tika。Apache Tika 基本上是一个从多种类型的文档(包括 PDF)中提取数据的工具包。
The benefits of Tika (besides being free), is that is used to be a subproject of Apache Lucene, which is a very robust open-source search engine. Tika includes a built-in PDF parser that uses a SAX Content Handler to pass PDF data to your application. It can also extract data from encrypted PDFs and it allows you to create or subclass an existing parser to customize the behavior.
Tika 的好处(除了免费)是它曾经是 Apache Lucene 的一个子项目,它是一个非常强大的开源搜索引擎。Tika 包括一个内置的 PDF 解析器,它使用 SAX 内容处理程序将 PDF 数据传递给您的应用程序。它还可以从加密的 PDF 中提取数据,并允许您创建或子类化现有的解析器以自定义行为。
The code is simple. To extract the data from a PDF, all you need to do is create a Parser class that implements the Parser interface and define a parse() method:
代码很简单。要从 PDF 中提取数据,您需要做的就是创建一个实现 Parser 接口的 Parser 类并定义一个 parse() 方法:
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
metadata.set("Hello", "World");
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
xhtml.startDocument();
xhtml.endDocument();
}
Then, to run the parser, you could do something like this:
然后,要运行解析器,您可以执行以下操作:
InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());
回答by Maurício Linhares
I am using JPedaland I'm really happy with the results. It isn't free but it's high quality and the output for image generation from pdfs or text extraction is really nice.
我正在使用JPedal,我对结果非常满意。它不是免费的,但质量很高,从 pdf 或文本提取生成图像的输出非常好。
And as a paid library, the support is always there to answer.
作为付费图书馆,我们随时提供支持。
回答by Petteri Hietavirta
I have used PDFBox to extract text for Lucene indexing without too many issues. Its error/warning logging is quite verbose if I remember right - what was the cause for those errors you received?
我已经使用 PDFBox 为 Lucene 索引提取文本,没有太多问题。如果我没记错的话,它的错误/警告日志记录非常冗长 - 您收到的这些错误的原因是什么?
回答by testing123
I understand this post is pretty old but I would recommend using itext from here: http://sourceforge.net/projects/itext/If you are using maven you can pull the jars in from maven central: http://mvnrepository.com/artifact/com.itextpdf/itextpdf
我知道这篇文章已经很旧了,但我建议从这里使用 itext:http: //sourceforge.net/projects/itext/如果您使用的是 maven,则可以从 maven 中心拉取罐子:http: //mvnrepository.com /artifact/com.itextpdf/itextpdf
I can't understand how using it can be difficult:
我无法理解如何使用它会很困难:
PdfReader pdf = new PdfReader("path to your pdf file");
PdfTextExtractor parser = new PdfTextExtractor();
String output = parser.getTextFromPage(pdf, pageNumber);
assert output.contains("whatever you want to validate on that page");
回答by vishal kavita rathi
Import this Classes and add Jar Files 1.- pdfbox-app- 2.0.
导入此类并添加 Jar 文件 1.- pdfbox-app- 2.0。
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.support.FindBy;
import org.testng.Assert;
import org.testng.annotations.Test;
import java.io.File;
import java.io.IOException;
import java.text.ParseException;
import java.util.List;
import org.apache.log4j.Logger;
import org.apache.log4j.PropertyConfigurator;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.chrome.ChromeDriver;
import com.coencorp.selenium.framework.BasePage;
import com.coencorp.selenium.framework.ExcelReadWrite;
import com.relevantcodes.extentreports.LogStatus;
Add this code inside the class.
在类中添加此代码。
public void showList() throws InterruptedException, IOException {
showInspectionsLink.click();
waitForElement(hideInspectionsLink);
printButton.click();
Thread.sleep(10000);
String downloadPath = "C:\Users\Updoer\Downloads";
File getLatestFile = getLatestFilefromDir(downloadPath);
String fileName = getLatestFile.getName();
Assert.assertTrue(fileName.equals("Inspections.pdf"), "Downloaded file name is not
matching with expected file name");
Thread.sleep(10000);
//testVerifyPDFInURL();
PDDocument pd;
pd= PDDocument.load(new File("C:\Users\Updoer\Downloads\Inspections.pdf"));
System.out.println("Total Pages:"+ pd.getNumberOfPages());
PDFTextStripper pdf=new PDFTextStripper();
System.out.println(pdf.getText(pd));
Add this Method in same class.
在同一个类中添加这个方法。
public void testVerifyPDFInURL() {
WebDriver driver = new ChromeDriver();
driver.get("C:\Users\Updoer\Downloads\Inspections.pdf");
driver.findElement(By.linkText("Adeeb Khan")).click();
String getURL = driver.getCurrentUrl();
Assert.assertTrue(getURL.contains(".pdf"));
}
private File getLatestFilefromDir(String dirPath){
File dir = new File(dirPath);
File[] files = dir.listFiles();
if (files == null || files.length == 0) {
return null;
}
File lastModifiedFile = files[0];
for (int i = 1; i < files.length; i++) {
if (lastModifiedFile.lastModified() < files[i].lastModified()) {
lastModifiedFile = files[i];
}
}
return lastModifiedFile;
}