java 使用 PDFBox 获取 PDF 文本对象
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25398325/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Getting PDF TextObjects with PDFBox
提问by Phil
I have a PDF from which I extracted a page using PDFBox:
我有一个 PDF,我使用 PDFBox 从中提取了一个页面:
(...)
File input = new File("C:\temp\sample.pdf");
document = PDDocument.load(input);
List allPages = document.getDocumentCatalog().getAllPages();
PDPage page = (PDPage) allPages.get(2);
PDStream contents = page.getContents();
if (contents != null) {
System.out.println(contents.getInputStreamAsString());
(...)
This gives the following result, which looks like something you'd expect, based on the PDF spec.
根据PDF 规范,这给出了以下结果,看起来像您期望的结果。
q
/GS0 gs
/Fm0 Do
Q
/Span <</Lang (en-US)/MCID 88 >>BDC
BT
/CS0 cs 0 0 0 scn
/GS1 gs
/T1_0 1 Tf
8.5 0 0 8.5 70.8661 576 Tm
(This page has been intentionally left blank.)Tj
ET
EMC
1 1 1 scn
/GS0 gs
22.677 761.102 28.346 32.599 re
f
/Span <</Lang (en-US)/MCID 89 >>BDC
BT
0.531 0.53 0.528 scn
/T1_1 1 Tf
9 0 0 9 45.7136 761.1024 Tm
(2)Tj
ET
EMC
q
0 g
/Fm1 Do
Q
What I'm looking for is to extract the PDF TextObjects (as described in par 5.3 of the PDF spec) on the page as java Objects, so basically the pieces between BT an ET (two of 'en on this page). They should at least contain everything between the brackets preceding 'Tj' as a String, and an x and y co?rdinate based on the 'Tm' (or a 'Td' operator, etc.). Other attributes would be a bonus, but are not required.
我正在寻找的是将页面上的 PDF 文本对象(如 PDF 规范的第 5.3 节所述)提取为 java 对象,因此基本上是 BT 和 ET 之间的部分(本页上的两个 'en)。它们至少应该包含 'Tj' 之前的方括号之间的所有内容作为字符串,以及基于 'Tm'(或 'Td' 运算符等)的 x 和 y 坐标。其他属性将是一个奖励,但不是必需的。
The PDFTextStripper seems to give me either each character with attributes as a TextPosition (too much noise for my purpose), or all the Text as one long String.
PDFTextStripper 似乎将每个字符的属性作为 TextPosition(对我来说噪音太大),或者将所有文本作为一个长字符串。
Does PDFBox have a feature that parses a Page and provides TextObjects like this that I missed? Or else, if I am to extend PDFBox to get what I need, where should I start? Any help is welcome.
PDFBox 是否具有解析页面并提供我错过的这样的 TextObjects 的功能?或者,如果我要扩展 PDFBox 以获得我需要的东西,我应该从哪里开始?欢迎任何帮助。
EDIT: Found another question here, that gives inspiration on how I might build what I need. If I succeed, I'll check back. Still looking forward to any help you may have, though.
编辑:在这里发现了另一个问题,它为我如何构建我需要的东西提供了灵感。如果我成功了,我会回来查看的。不过,仍然期待您可能获得的任何帮助。
Thanks,
谢谢,
Phil
菲尔
回答by Phil
Based on the linked question and the hint by mklyesterday (thanks!), I've decided to build something to parse the tokens. Something to consider is that within a PDF Text Object, the attributes precede the operator, so I collect all attributes in a collection until I encounter the operator. Then, when I know what operator the attributes belong to, I move them to their proper locations. This is what I've come up with:
根据链接的问题和mkl昨天的提示(谢谢!),我决定构建一些东西来解析令牌。需要考虑的是,在 PDF 文本对象中,属性位于运算符之前,因此我收集集合中的所有属性,直到遇到运算符为止。然后,当我知道属性属于哪个运算符时,我将它们移动到适当的位置。这是我想出的:
import java.io.File;
import java.util.List;
import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFOperator;
public class TextExtractor {
public static void main(String[] args) {
try {
File input = new File("C:\some\file.pdf");
PDDocument document = PDDocument.load(input);
List allPages = document.getDocumentCatalog().getAllPages();
// just parsing page 2 here, as it's only a sample
PDPage page = (PDPage) allPages.get(2);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List tokens = parser.getTokens();
boolean parsingTextObject = false; //boolean to check whether the token being parsed is part of a TextObject
PDFTextObject textobj = new PDFTextObject();
for (int i = 0; i < tokens.size(); i++)
{
Object next = tokens.get(i);
if (next instanceof PDFOperator) {
PDFOperator op = (PDFOperator) next;
switch(op.getOperation()){
case "BT":
//BT: Begin Text.
parsingTextObject = true;
textobj = new PDFTextObject();
break;
case "ET":
parsingTextObject = false;
System.out.println("Text: " + textobj.getText() + "@" + textobj.getX() + "," + textobj.getY());
break;
case "Tj":
textobj.setText();
break;
case "Tm":
textobj.setMatrix();
break;
default:
//System.out.println("unsupported operation " + op.getOperation());
}
textobj.clearAllAttributes();
}
else if (parsingTextObject) {
textobj.addAttribute(next);
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
In combination with:
结合:
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.cos.COSFloat;
import org.apache.pdfbox.cos.COSInteger;
import org.apache.pdfbox.cos.COSString;
class PDFTextObject{
private List attributes = new ArrayList<Object>();
private String text = "";
private float x = -1;
private float y = -1;
public void clearAllAttributes(){
attributes = new ArrayList<Object>();
}
public void addAttribute(Object anAttribute){
attributes.add(anAttribute);
}
public void setText(){
//Move the contents of the attributes to the text attribute.
for (int i = 0; i < attributes.size(); i++){
if (attributes.get(i) instanceof COSString){
COSString aString = (COSString) attributes.get(i);
text = text + aString.getString();
}
else {
System.out.println("Whoops! Wrong type of property...");
}
}
}
public String getText(){
return text;
}
public void setMatrix(){
//Move the contents of the attributes to the x and y attributes.
//A Matrix has 6 attributes, the last two of which are x and y
for (int i = 4; i < attributes.size(); i++){
float curval = -1;
if (attributes.get(i) instanceof COSInteger){
COSInteger aCOSInteger = (COSInteger) attributes.get(i);
curval = aCOSInteger.floatValue();
}
if (attributes.get(i) instanceof COSFloat){
COSFloat aCOSFloat = (COSFloat) attributes.get(i);
curval = aCOSFloat.floatValue();
}
switch(i) {
case 4:
x = curval;
break;
case 5:
y = curval;
break;
}
}
}
public float getX(){
return x;
}
public float getY(){
return y;
}
}
It gives the output:
它给出了输出:
Text: This page has been intentionally left [email protected],576.0
Text: [email protected],761.1024
While it does the trick, I'm sure I've broken some conventions and haven't always written the most elegant code. Improvements and alternate solutions are welcome.
虽然它确实有效,但我确信我已经打破了一些约定并且并不总是编写最优雅的代码。欢迎改进和替代解决方案。
回答by raisercostin
I added a version of the Phil response with pdfbox-2.0.1
我用 pdfbox-2.0.1 添加了 Phil 响应的一个版本
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageTree;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.cos.COSFloat;
import org.apache.pdfbox.cos.COSInteger;
import org.apache.pdfbox.cos.COSString;
public class TextExtractor {
public static void main(String[] args) {
try {
File input = new File("src\test\resources\files\file1.pdf");
PDDocument document = PDDocument.load(input);
PDPageTree allPages = document.getDocumentCatalog().getPages();
// just parsing page 2 here, as it's only a sample
PDPage page = allPages.get(0);
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
boolean parsingTextObject = false; // boolean to check whether the token
// being parsed is part of a TextObject
PDFTextObject textobj = new PDFTextObject();
for (int i = 0; i < tokens.size(); i++) {
Object next = tokens.get(i);
if (next instanceof Operator) {
Operator op = (Operator) next;
switch (op.getName()) {
case "BT":
// BT: Begin Text.
parsingTextObject = true;
textobj = new PDFTextObject();
break;
case "ET":
parsingTextObject = false;
System.out.println("Text: " + textobj.getText() + "@" + textobj.getX() + "," + textobj.getY());
break;
case "Tj":
textobj.setText();
break;
case "Tm":
textobj.setMatrix();
break;
default:
System.out.println("unsupported operation " + op);
}
textobj.clearAllAttributes();
} else if (parsingTextObject) {
textobj.addAttribute(next);
} else {
System.out.println("ignore "+next.getClass()+" -> "+next);
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
static class PDFTextObject{
private List attributes = new ArrayList<Object>();
private String text = "";
private float x = -1;
private float y = -1;
public void clearAllAttributes(){
attributes = new ArrayList<Object>();
}
public void addAttribute(Object anAttribute){
attributes.add(anAttribute);
}
public void setText(){
//Move the contents of the attributes to the text attribute.
for (int i = 0; i < attributes.size(); i++){
if (attributes.get(i) instanceof COSString){
COSString aString = (COSString) attributes.get(i);
text = text + aString.getString();
}
else {
System.out.println("Whoops! Wrong type of property...");
}
}
}
public String getText(){
return text;
}
public void setMatrix(){
//Move the contents of the attributes to the x and y attributes.
//A Matrix has 6 attributes, the last two of which are x and y
for (int i = 4; i < attributes.size(); i++){
float curval = -1;
if (attributes.get(i) instanceof COSInteger){
COSInteger aCOSInteger = (COSInteger) attributes.get(i);
curval = aCOSInteger.floatValue();
}
if (attributes.get(i) instanceof COSFloat){
COSFloat aCOSFloat = (COSFloat) attributes.get(i);
curval = aCOSFloat.floatValue();
}
switch(i) {
case 4:
x = curval;
break;
case 5:
y = curval;
break;
}
}
}
public float getX(){
return x;
}
public float getY(){
return y;
}
}
}