Java pdfbox 标题版本信息错误

Question

提问by user2638084

I used PDFbox for parsing that pdf document.It throws exception that says it can not find header version info . Any idea?

我使用 PDFbox 来解析那个 pdf 文档。它抛出异常，说它找不到标题版本信息。任何的想法？

I think version is 1.3 I saw it when I cast every byte to char . link is http://www.selab.isti.cnr.it/ws-mate/example.pdf

我认为版本是 1.3 我在将每个字节转换为 char 时看到了它。链接是http://www.selab.isti.cnr.it/ws-mate/example.pdf

here codes of method and output:

这里的方法和输出代码：

 public String PDFtest(String textLink) throws IOException{
        PDFParser parser;
        String parsedText = null;
        PDFTextStripper pdfStripper;
        PDDocument pdDoc;
        COSDocument cosDoc;
        PDDocumentInformation pdDocInfo;


    StringBuilder sd=new StringBuilder();
    URL link;
    try {
        link = new URL(textLink);
        URLConnection urlConn = link.openConnection();
        BufferedInputStream in = null;
        in = new BufferedInputStream(urlConn.getInputStream());
        byte data[] = new byte[1024];
        in.read(data, 0, 1024);

    parser = new PDFParser(in);
    parser.parse();
    cosDoc = parser.getDocument();
    pdfStripper = new PDFTextStripper();
    pdDoc = new PDDocument(cosDoc);
    parsedText = pdfStripper.getText(pdDoc);
    } catch (MalformedURLException ex) {
        Logger.getLogger(HTMLhelper.class.getName()).log(Level.SEVERE, null, ex);
    }
    catch (NumberFormatException e){
        System.out.println("hata");
    }

    return parsedText;



}

Exception:

例外：

Exception in thread "main" java.io.IOException: Error: Header doesn't contain versioninfo
    at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:317)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:173)
    at ParsingMachine.HTMLhelper.PDFtest(HTMLhelper.java:99)
    at ParsingMachine.tester.main(tester.java:18)
Java Result: 1

Answer 1

回答by mkl

You first read the leading kilobyte of data into a byte array:

首先将前导千字节数据读入字节数组：

in.read(data, 0, 1024);

and then you expect PDFBox to get along with the remaining bytes

然后你期望 PDFBox 处理剩余的字节

parser = new PDFParser(in);
parser.parse();

Most likely the actual PDF header is contained in those leading bytes you kept from the PDFBox parser.

很可能实际的 PDF 标头包含在您从 PDFBox 解析器中保留的那些前导字节中。

Thus, simply allow PDFBox to read all data.

因此，只需允许 PDFBox 读取所有数据。

Answer 2

回答by asraniinfo

You must be merging a file which is not in pdf format. Please check carefully if you have any file in the list other then pdf.

您必须合并非 pdf 格式的文件。请仔细检查列表中是否有除pdf以外的任何文件。

Answer 3

回答by murphy1310

In my case, I was iterating through the files in a directory.
Windows has a Thumbs.dbfile in any directory.
This was interfering with the pdf file process.
Applying a filter to only pick PDF files (*.pdf) helped.
Cheers.

就我而言，我正在遍历目录中的文件。
WindowsThumbs.db在任何目录中都有一个文件。
这干扰了pdf文件过程。
应用过滤器仅选择 PDF 文件 ( *.pdf) 有所帮助。
干杯。

Java pdfbox 标题版本信息错误

提问by user2638084

回答by mkl

回答by asraniinfo

回答by murphy1310

相关推荐

最近更新

标签

Java pdfbox 标题版本信息错误

提问by user2638084

回答by mkl

回答by asraniinfo

回答by murphy1310

相关推荐

Java 检查某些 exe 程序是否在 Windows 上运行

Java math.min 实际上是如何工作的？

Hadoop 上的 Java 与 Python

从 byte[] 创建一个临时的 java.io.File

相关推荐

最近更新

标签