Java 如何确定文件是否为 PDF 文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/941813/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 21:20:16  来源:igfitidea点击:

How can I determine if a file is a PDF file?

javavalidationpdftext

提问by

I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to check if the provided file is indeed a valid PDF?

我在 Java 中使用 PdfBox 从 PDF 文件中提取文本。提供的一些输入文件无效,PDFTextStripper 停止处理这些文件。有没有一种干净的方法来检查提供的文件是否确实是有效的 PDF?

采纳答案by Persimmonium

you can find out the mime type of a file (or byte array), so you dont dumbly rely on the extension. I do it with aperture's MimeExtractor (http://aperture.sourceforge.net/) or I saw some days ago a library just for that (http://sourceforge.net/projects/mime-util)

您可以找出文件(或字节数组)的 mime 类型,因此您不会愚蠢地依赖扩展名。我用aperture 的MimeExtractor ( http://aperture.sourceforge.net/)来做,或者几天前我看到了一个专门为此的库( http://sourceforge.net/projects/mime-util)

I use aperture to extract text from a variety of files, not only pdf, but have to tweak thinks for pdfs for example (aperture uses pdfbox, but i added another library as fallback when pdfbox fails)

我使用aperture从各种文件中提取文本,不仅是pdf,而且必须调整pdfs的想法(aperture使用pdfbox,但我添加了另一个库作为pdfbox失败时的后备)

回答by cagcowboy

Pdf files begin "%PDF" (open one in TextPad or similar and take a look)

Pdf 文件以“%PDF”开头(在 TextPad 或类似工具中打开一个并查看)

Any reason you can't just read the file with a StringReader and check for this?

有什么理由不能只用 StringReader 读取文件并检查这个吗?

回答by NinjaCross

Here is what I use into my NUnit tests, that must validate against multiple versions of PDF generated using Crystal Reports:

这是我在 NUnit 测试中使用的内容,必须针对使用 Crystal Reports 生成的多个 PDF 版本进行验证:

public static void CheckIsPDF(byte[] data)
    {
        Assert.IsNotNull(data);
        Assert.Greater(data.Length,4);

        // header 
        Assert.AreEqual(data[0],0x25); // %
        Assert.AreEqual(data[1],0x50); // P
        Assert.AreEqual(data[2],0x44); // D
        Assert.AreEqual(data[3],0x46); // F
        Assert.AreEqual(data[4],0x2D); // -

        if(data[5]==0x31 && data[6]==0x2E && data[7]==0x33) // version is 1.3 ?
        {                  
            // file terminator
            Assert.AreEqual(data[data.Length-7],0x25); // %
            Assert.AreEqual(data[data.Length-6],0x25); // %
            Assert.AreEqual(data[data.Length-5],0x45); // E
            Assert.AreEqual(data[data.Length-4],0x4F); // O
            Assert.AreEqual(data[data.Length-3],0x46); // F
            Assert.AreEqual(data[data.Length-2],0x20); // SPACE
            Assert.AreEqual(data[data.Length-1],0x0A); // EOL
            return;
        }

        if(data[5]==0x31 && data[6]==0x2E && data[7]==0x34) // version is 1.4 ?
        {
            // file terminator
            Assert.AreEqual(data[data.Length-6],0x25); // %
            Assert.AreEqual(data[data.Length-5],0x25); // %
            Assert.AreEqual(data[data.Length-4],0x45); // E
            Assert.AreEqual(data[data.Length-3],0x4F); // O
            Assert.AreEqual(data[data.Length-2],0x46); // F
            Assert.AreEqual(data[data.Length-1],0x0A); // EOL
            return;
        }

        Assert.Fail("Unsupported file format");
    }

回答by cherouvim

Since you use PDFBox you can simply do:

由于您使用 PDFBox,您可以简单地执行以下操作:

PDDocument.load(file);

It'll fail with an Exception if the PDF is corrupted etc.

如果 PDF 已损坏等,它将因异常而失败。

If it succeeds you can also check if the PDF is encrypted using .isEncrypted()

如果成功,您还可以检查 PDF 是否使用加密 .isEncrypted()

回答by Sheel

You have to try this....

你必须试试这个......

public boolean isPDF(File file){
    file = new File("Demo.pdf");
    Scanner input = new Scanner(new FileReader(file));
    while (input.hasNextLine()) {
        final String checkline = input.nextLine();
        if(checkline.contains("%PDF-")) { 
            // a match!
            return true;
        }  
    }
    return false;
}

回答by Andrei Solntsev

There is a very convenient and simple library for testing PDF content: https://github.com/codeborne/pdf-test

有一个非常方便简单的测试PDF内容的库:https: //github.com/codeborne/pdf-test

API is very simple:

API非常简单:

import com.codeborne.pdftest.PDF;
import static com.codeborne.pdftest.PDF.*;
import static org.junit.Assert.assertThat;

public class PDFContainsTextTest {
  @Test
  public void canAssertThatPdfContainsText() {
    PDF pdf = new PDF(new File("src/test/resources/50quickideas.pdf"));
    assertThat(pdf, containsText("50 Quick Ideas to Improve your User Stories"));
  }
}

回答by Roger Keays

Here an adapted Java version of NinjaCross's code.

这是 NinjaCross 代码的改编 Java 版本。

/**
 * Test if the data in the given byte array represents a PDF file.
 */
public static boolean is_pdf(byte[] data) {
    if (data != null && data.length > 4 &&
            data[0] == 0x25 && // %
            data[1] == 0x50 && // P
            data[2] == 0x44 && // D
            data[3] == 0x46 && // F
            data[4] == 0x2D) { // -

        // version 1.3 file terminator
        if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x33 &&
                data[data.length - 7] == 0x25 && // %
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x45 && // E
                data[data.length - 4] == 0x4F && // O
                data[data.length - 3] == 0x46 && // F
                data[data.length - 2] == 0x20 && // SPACE
                data[data.length - 1] == 0x0A) { // EOL
            return true;
        }

        // version 1.3 file terminator
        if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x34 &&
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x25 && // %
                data[data.length - 4] == 0x45 && // E
                data[data.length - 3] == 0x4F && // O
                data[data.length - 2] == 0x46 && // F
                data[data.length - 1] == 0x0A) { // EOL
            return true;
        }
    }
    return false;
}

And some simple unit tests:

还有一些简单的单元测试:

@Test
public void test_valid_pdf_1_3_data_is_pdf() {
    assertTrue(is_pdf("%PDF-1.3 CONTENT %%EOF \n".getBytes()));
}

@Test
public void test_valid_pdf_1_4_data_is_pdf() {
    assertTrue(is_pdf("%PDF-1.4 CONTENT %%EOF\n".getBytes()));
}

@Test
public void test_invalid_data_is_not_pdf() {
    assertFalse(is_pdf("Hello World".getBytes()));
}

If you come up with any failing unit tests, please let me know.

如果您提出任何失败的单元测试,请告诉我。

回答by skashyap

Maybe I am too late to answer. But you should have a look at Tika. It uses PDFBox Parser internally to parse PDF's

可能我回答的太晚了。但你应该看看蒂卡。它在内部使用 PDFBox Parser 来解析 PDF

You just need to import tika-app-latest*.jar

你只需要导入 tika-app-latest*.jar

 public String parseToStringExample() throws IOException, SAXException, TikaException 
 {

      Tika tika = new Tika();
      try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
           return tika.parseToString(stream); // This should return you the pdf's text
      }
}

It would be a much cleaner solution . You can refer here for more details of Tika Usage : https://tika.apache.org/1.12/api/

这将是一个更清洁的解决方案。您可以在此处参考有关 Tika 用法的更多详细信息:https: //tika.apache.org/1.12/api/

回答by arjun kumar

I was using some of the suggestions I found here and on other sites/posts for determining whether a pdf was valid or not. I purposely corrupted a pdf file, and unfortunately, many of the solutions did not detect that the file was corrupted.

我正在使用我在此处和其他网站/帖子上找到的一些建议来确定 pdf 是否有效。我故意损坏了一个pdf文件,不幸的是,许多解决方案都没有检测到文件已损坏。

Eventually, after tinkering around with different methods in the API, I tried this:

最终,在修改了 API 中的不同方法之后,我尝试了这个:

PDDocument.load(file).getPage(0).getContents().toString();

This did not throw an exception, but it did output this:

这没有抛出异常,但确实输出了:

 WARN  [COSParser:1154] The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 171, length: 1145844, expected end position: 1146015

Personally, I wanted an exception to be thrown if the file was corrupted so I could handle it myself, but it appeared that the API I was implementing already handled them in their own way.

就我个人而言,我希望在文件损坏时抛出异常,以便我可以自己处理它,但似乎我正在实现的 API 已经以自己的方式处理了它们。

To get around this, I decided to try parsing the files using the class that gave the warm statement (COSParser). I found that there was a subclass, called PDFParser, which inherited a method called "setLenient", which was the key (https://pdfbox.apache.org/docs/2.0.4/javadocs/org/apache/pdfbox/pdfparser/COSParser.html).

为了解决这个问题,我决定尝试使用提供热语句的类 (COSParser) 来解析文件。我发现有一个子类,叫做PDFParser,它继承了一个叫做“setLenient”的方法,这是关键(https://pdfbox.apache.org/docs/2.0.4/javadocs/org/apache/pdfbox/pdfparser /COSParser.html)。

I then implemented the following:

然后我实现了以下内容:

        RandomAccessFile accessFile = new RandomAccessFile(file, "r");
        PDFParser parser = new PDFParser(accessFile); 
        parser.setLenient(false);
        parser.parse();

This threw an Exception for my corrupted file, as I wanted. Hope this helps someone out!

如我所愿,这为我损坏的文件抛出了一个异常。希望这可以帮助别人!

回答by Mohsen Abasi

The answer by Roger Keays is wrong! since not all PDF files in version 1.3 and not all terminated by EOL. The answer below works for all not corrupted pdf files:

Roger Keays 的回答是错误的!因为并非 1.3 版中的所有 PDF 文件都被 EOL 终止。以下答案适用于所有未损坏的 pdf 文件:

public static boolean is_pdf(byte[] data) {
    if (data != null && data.length > 4
            && data[0] == 0x25 && // %
            data[1] == 0x50 && // P
            data[2] == 0x44 && // D
            data[3] == 0x46 && // F
            data[4] == 0x2D) { // -

        // version 1.3 file terminator
        if (//data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x33 &&
                data[data.length - 7] == 0x25 && // %
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x45 && // E
                data[data.length - 4] == 0x4F && // O
                data[data.length - 3] == 0x46 && // F
                data[data.length - 2] == 0x20 // SPACE
                //&& data[data.length - 1] == 0x0A// EOL
                ) {
            return true;
        }

        // version 1.3 file terminator
        if (//data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x34 &&
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x25 && // %
                data[data.length - 4] == 0x45 && // E
                data[data.length - 3] == 0x4F && // O
                data[data.length - 2] == 0x46 // F
                //&& data[data.length - 1] == 0x0A // EOL
                ) {
            return true;
        }
    }
    return false;
}