java 如何使用pdfbox获取字体颜色

Question

提问by Neeraj

I am trying to extract text with all information from the pdf using pdfbox. I got all the information i want, except color. I tried different ways to get the fontcolor (including Getting Text Colour with PDFBox). But not working. And now I copied code from PageDrawer class of pdfBox. But then also the RGB value is not correct.

我正在尝试使用 pdfbox 从 pdf 中提取包含所有信息的文本。我得到了我想要的所有信息，除了颜色。我尝试了不同的方法来获取字体颜色（包括使用 PDFBox 获取文本颜色）。但不工作。现在我从 pdfBox 的 PageDrawer 类中复制了代码。但是RGB值也不正确。

protected void processTextPosition(TextPosition text) {

        Composite com;
        Color col;
        switch(this.getGraphicsState().getTextState().getRenderingMode()) {
        case PDTextState.RENDERING_MODE_FILL_TEXT:
            com = this.getGraphicsState().getNonStrokeJavaComposite();
            int r =       this.getGraphicsState().getNonStrokingColor().getJavaColor().getRed();
            int g = this.getGraphicsState().getNonStrokingColor().getJavaColor().getGreen();
            int b = this.getGraphicsState().getNonStrokingColor().getJavaColor().getBlue();
            int rgb = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB();
            float []cosp = this.getGraphicsState().getNonStrokingColor().getColorSpaceValue();
            PDColorSpace pd = this.getGraphicsState().getNonStrokingColor().getColorSpace();
            break;
        case PDTextState.RENDERING_MODE_STROKE_TEXT:
            System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString());
            System.out.println(this.getGraphicsState().getStrokingColor().getJavaColor().getRGB());
           break;
        case PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT:
            //basic support for text rendering mode "invisible"
            Color nsc = this.getGraphicsState().getStrokingColor().getJavaColor();
            float[] components = {Color.black.getRed(),Color.black.getGreen(),Color.black.getBlue()};
            Color  c1 = new Color(nsc.getColorSpace(),components,0f);
            System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString());
            break;
        default:
            System.out.println(this.getGraphicsState().getNonStrokeJavaComposite().toString());
            System.out.println(this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB());
    }

I am using the above code. The values getting are r = 0, g = 0, b = 0, inside cosp object value is [0.0], inside pd object array = null and colorSpace = null. and RGB value is always -16777216. Please help me. Thanks in advance.

我正在使用上面的代码。得到的值是 r = 0, g = 0, b = 0, 在 cosp 对象中值为 [0.0], 在 pd 对象数组中 = null 和 colorSpace = null。并且 RGB 值始终为 -16777216。请帮我。提前致谢。

Answer 1

采纳答案by demongolem

I tried the code in the link you posted and it worked for me. The colors I get back are 148.92, 179.01001 and 214.965. I wish I could give you my PDF to work with, maybe if I store it externally to SO? My PDF used a sort of palish blue color and that seems to match. It was just one page of text created in Word 2010 and exported, nothing too intense.

我尝试了您发布的链接中的代码，它对我有用。我得到的颜色是 148.92、179.01001 和 214.965。我希望我可以给你我的 PDF 来使用，也许如果我将它存储在 SO 的外部？我的 PDF 使用了一种淡蓝色，看起来很匹配。它只是在 Word 2010 中创建并导出的一页文本，没有太强烈的内容。

A couple of suggestions ....

几个建议......

Recall that the value returned is a float between 0 and 1. If a value is accidentally cast to int, then of course the values will end up containing nearly all 0. The linked to code multiples by 255 to get a range of 0 to 255.
As the commenter said, the most common color for a PDF file is black which is 0 0 0

回想一下，返回的值是一个介于 0 和 1 之间的浮点数。如果一个值意外地转换为 int，那么当然这些值最终将包含几乎所有的 0。链接到代码乘以 255 以获得 0 到 255 的范围.
正如评论者所说，PDF 文件最常见的颜色是黑色，即 0 0 0

That is all I can think of now, otherwise I have version of 1.7.1 of pdfbox and fontbox and like I said I pretty much followed the link you gave.

这就是我现在能想到的全部，否则我有 pdfbox 和 fontbox 的 1.7.1 版本，就像我说的那样，我几乎按照您提供的链接进行操作。

EDIT

编辑

Based upon my comments, here perhaps is a minorly invasive way of doing it for pdf files like color.pdf?

根据我的评论，对于 pdf 文件，这可能是一种微创方式，例如color.pdf？

In PDFStreamEngine.javain the processOperatormethod one can do inside the try block

在PDFStreamEngine.java的processOperator方法可以try块内做

if (operation.equals("RG")) {
   // stroking color space
   System.out.println(operation);
   System.out.println(arguments);
} else if (operation.equals("rg")) {
   // non-stroking color space
   System.out.println(operation);
   System.out.println(arguments);
} else if (operation.equals("BT")) {
   System.out.println(operation);    
} else if (operation.equals("ET")) {
   System.out.println(operation);           
}

This will show you the information, then it is up to you to process the color information for each section according to your needs. Here is a snippet from the beginning of the output of the above code when run on color.pdf...

这将显示信息，然后由您根据需要处理每个部分的颜色信息。这是上面代码在运行时输出开头的片段color.pdf...

BT rG [COSInt(1), COSInt(0), CosInt(0)] RG [COSInt(1), COSInt(0), CosInt(0)] ET BT ET BT rG [COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}] RG [COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}] ET ......

You see in the above output an empty BT ET section, this being a section which is marked DEVICEGRAY. All the other give you [0,1] values for the R, G and B components

您在上面的输出中看到一个空的 BT ET 部分，这是一个标记为 DEVICEGRAY 的部分。所有其他为您提供 R、G 和 B 分量的 [0,1] 值

Answer 2

回答by kiranbkrishna

I also ended up doing something like this. Pasting code below, hope it helps someone.

我也最终做了这样的事情。粘贴代码如下，希望对大家有所帮助。

import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.pdmodel.graphics.PDGraphicsState;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.ResourceLoader;
import org.apache.pdfbox.util.TextPosition;

public class Parser extends PDFTextStripper {

public Parser() throws IOException {
    super(ResourceLoader.loadProperties(
            "org/apache/pdfbox/resources/PageDrawer.properties", true));
    super.setSortByPosition(true);
}

public void parse(String path) throws IOException{
    PDDocument doc = PDDocument.load(path);
    List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
    for (PDPage page : pages) {
        this.processStream(page, page.getResources(), page.getContents().getStream());
    }
}

@Override
protected void processTextPosition(TextPosition text) {
    try {
        PDGraphicsState graphicsState = getGraphicsState();
        System.out.println("R = " + graphicsState.getNonStrokingColor().getJavaColor().getRed());
        System.out.println("G = " + graphicsState.getNonStrokingColor().getJavaColor().getGreen());
        System.out.println("B = " + graphicsState.getNonStrokingColor().getJavaColor().getBlue());
    }
    catch (IOException ioe) {}

}

public static void main(String[] args) throws IOException, COSVisitorException {
    Parser p = new Parser();
    p.parse("/Users/apple/Desktop/123.pdf");
}

}

Answer 3

回答by Jubin Patel

I found some code in one of my maintenance program.
I do not know it works for you or not, please try It. Also check out this link http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/common/class-use/PDStream.html

我在我的一个维护程序中找到了一些代码。
我不知道它是否适合您，请尝试一下。另请查看此链接http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/common/class-use/PDStream.html

It may help you

它可能会帮助你

PDDocument doc = null;
try {
    doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
    PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
    PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
    engine.processStream(page, page.findResources(), page.getContents().getStream());
    PDGraphicsState graphicState = engine.getGraphicsState();
    System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
    float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
    for (float c : colorSpaceValues) {
        System.out.println(c * 255);
    }
}
finally {
    if (doc != null) {
        doc.close();
    }

Answer 4

回答by Robert Lill

With the pdfbox verson 2.0+ it is necessary to choose these operators in the constructor of your overwritten PDFTextStripper:

对于 pdfbox 版本 2.0+，有必要在覆盖的 PDFTextStripper 的构造函数中选择这些运算符：

addOperator(new SetStrokingColorSpace());
addOperator(new SetNonStrokingColorSpace());
addOperator(new SetStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceRGBColor());
addOperator(new SetStrokingDeviceRGBColor());
addOperator(new SetNonStrokingDeviceGrayColor());
addOperator(new SetStrokingDeviceGrayColor());
addOperator(new SetStrokingColor());
addOperator(new SetStrokingColorN());
addOperator(new SetNonStrokingColor());
addOperator(new SetNonStrokingColorN());

Only then getGraphicsState() will return proper information.

只有这样 getGraphicsState() 才会返回正确的信息。

See https://pdfbox.apache.org/2.0/migration.html

见https://pdfbox.apache.org/2.0/migration.html

java 如何使用pdfbox获取字体颜色

提问by Neeraj

采纳答案by demongolem

回答by kiranbkrishna

回答by Jubin Patel

回答by Robert Lill

相关推荐

最近更新

标签

java 如何使用pdfbox获取字体颜色

提问by Neeraj

采纳答案by demongolem

回答by kiranbkrishna

回答by Jubin Patel

回答by Robert Lill

相关推荐

Java：字符串索引超出范围：-1

java 如何在 Web 应用程序中实现自动注销？

java 使用 FileInputStream/ObjectOutputStream 发送大文件

java Tomcat 上的 Spring 3 JMS

相关推荐

最近更新

标签