Java pdfbox 标题版本信息错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19013936/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pdfbox header version info error
提问by user2638084
I used PDFbox for parsing that pdf document.It throws exception that says it can not find header version info . Any idea?
我使用 PDFbox 来解析那个 pdf 文档。它抛出异常,说它找不到标题版本信息。任何的想法?
I think version is 1.3 I saw it when I cast every byte to char . link is http://www.selab.isti.cnr.it/ws-mate/example.pdf
我认为版本是 1.3 我在将每个字节转换为 char 时看到了它。链接是http://www.selab.isti.cnr.it/ws-mate/example.pdf
here codes of method and output:
这里的方法和输出代码:
public String PDFtest(String textLink) throws IOException{
PDFParser parser;
String parsedText = null;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
StringBuilder sd=new StringBuilder();
URL link;
try {
link = new URL(textLink);
URLConnection urlConn = link.openConnection();
BufferedInputStream in = null;
in = new BufferedInputStream(urlConn.getInputStream());
byte data[] = new byte[1024];
in.read(data, 0, 1024);
parser = new PDFParser(in);
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (MalformedURLException ex) {
Logger.getLogger(HTMLhelper.class.getName()).log(Level.SEVERE, null, ex);
}
catch (NumberFormatException e){
System.out.println("hata");
}
return parsedText;
}
Exception:
例外:
Exception in thread "main" java.io.IOException: Error: Header doesn't contain versioninfo
at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:317)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:173)
at ParsingMachine.HTMLhelper.PDFtest(HTMLhelper.java:99)
at ParsingMachine.tester.main(tester.java:18)
Java Result: 1
回答by mkl
You first read the leading kilobyte of data into a byte array:
首先将前导千字节数据读入字节数组:
in.read(data, 0, 1024);
and then you expect PDFBox to get along with the remaining bytes
然后你期望 PDFBox 处理剩余的字节
parser = new PDFParser(in);
parser.parse();
Most likely the actual PDF header is contained in those leading bytes you kept from the PDFBox parser.
很可能实际的 PDF 标头包含在您从 PDFBox 解析器中保留的那些前导字节中。
Thus, simply allow PDFBox to read all data.
因此,只需允许 PDFBox 读取所有数据。
回答by asraniinfo
You must be merging a file which is not in pdf format. Please check carefully if you have any file in the list other then pdf.
您必须合并非 pdf 格式的文件。请仔细检查列表中是否有除pdf以外的任何文件。
回答by murphy1310
In my case, I was iterating through the files in a directory.
Windows has a Thumbs.db
file in any directory.
This was interfering with the pdf file process.
Applying a filter to only pick PDF files (*.pdf
) helped.
Cheers.
就我而言,我正在遍历目录中的文件。
WindowsThumbs.db
在任何目录中都有一个文件。
这干扰了pdf文件过程。
应用过滤器仅选择 PDF 文件 ( *.pdf
) 有所帮助。
干杯。