Java pdfbox 标题版本信息错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19013936/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 13:22:32  来源:igfitidea点击:

pdfbox header version info error

javaparsingpdfpdfbox

提问by user2638084

I used PDFbox for parsing that pdf document.It throws exception that says it can not find header version info . Any idea?

我使用 PDFbox 来解析那个 pdf 文档。它抛出异常,说它找不到标题版本信息。任何的想法?

I think version is 1.3 I saw it when I cast every byte to char . link is http://www.selab.isti.cnr.it/ws-mate/example.pdf

我认为版本是 1.3 我在将每个字节转换为 char 时看到了它。链接是http://www.selab.isti.cnr.it/ws-mate/example.pdf

here codes of method and output:

这里的方法和输出代码:

 public String PDFtest(String textLink) throws IOException{
        PDFParser parser;
        String parsedText = null;
        PDFTextStripper pdfStripper;
        PDDocument pdDoc;
        COSDocument cosDoc;
        PDDocumentInformation pdDocInfo;


    StringBuilder sd=new StringBuilder();
    URL link;
    try {
        link = new URL(textLink);
        URLConnection urlConn = link.openConnection();
        BufferedInputStream in = null;
        in = new BufferedInputStream(urlConn.getInputStream());
        byte data[] = new byte[1024];
        in.read(data, 0, 1024);

    parser = new PDFParser(in);
    parser.parse();
    cosDoc = parser.getDocument();
    pdfStripper = new PDFTextStripper();
    pdDoc = new PDDocument(cosDoc);
    parsedText = pdfStripper.getText(pdDoc);
    } catch (MalformedURLException ex) {
        Logger.getLogger(HTMLhelper.class.getName()).log(Level.SEVERE, null, ex);
    }
    catch (NumberFormatException e){
        System.out.println("hata");
    }

    return parsedText;



}

Exception:

例外:

Exception in thread "main" java.io.IOException: Error: Header doesn't contain versioninfo
    at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:317)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:173)
    at ParsingMachine.HTMLhelper.PDFtest(HTMLhelper.java:99)
    at ParsingMachine.tester.main(tester.java:18)
Java Result: 1

回答by mkl

You first read the leading kilobyte of data into a byte array:

首先将前导千字节数据读入字节数组:

in.read(data, 0, 1024);

and then you expect PDFBox to get along with the remaining bytes

然后你期望 PDFBox 处理剩余的字节

parser = new PDFParser(in);
parser.parse();

Most likely the actual PDF header is contained in those leading bytes you kept from the PDFBox parser.

很可能实际的 PDF 标头包含在您从 PDFBox 解析器中保留的那些前导字节中。

Thus, simply allow PDFBox to read all data.

因此,只需允许 PDFBox 读取所有数据。

回答by asraniinfo

You must be merging a file which is not in pdf format. Please check carefully if you have any file in the list other then pdf.

您必须合并非 pdf 格式的文件。请仔细检查列表中是否有除pdf以外的任何文件。

回答by murphy1310

In my case, I was iterating through the files in a directory.
Windows has a Thumbs.dbfile in any directory.
This was interfering with the pdf file process.
Applying a filter to only pick PDF files (*.pdf) helped.
Cheers.

就我而言,我正在遍历目录中的文件。
WindowsThumbs.db在任何目录中都有一个文件。
这干扰了pdf文件过程。
应用过滤器仅选择 PDF 文件 ( *.pdf) 有所帮助。
干杯。