java 如何使用pdfbox获取字体颜色
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10844271/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get font color using pdfbox
提问by Neeraj
I am trying to extract text with all information from the pdf using pdfbox. I got all the information i want, except color. I tried different ways to get the fontcolor (including Getting Text Colour with PDFBox). But not working. And now I copied code from PageDrawer class of pdfBox. But then also the RGB value is not correct.
我正在尝试使用 pdfbox 从 pdf 中提取包含所有信息的文本。我得到了我想要的所有信息,除了颜色。我尝试了不同的方法来获取字体颜色(包括使用 PDFBox 获取文本颜色)。但不工作。现在我从 pdfBox 的 PageDrawer 类中复制了代码。但是RGB值也不正确。
protected void processTextPosition(TextPosition text) {
Composite com;
Color col;
switch(this.getGraphicsState().getTextState().getRenderingMode()) {
case PDTextState.RENDERING_MODE_FILL_TEXT:
com = this.getGraphicsState().getNonStrokeJavaComposite();
int r = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRed();
int g = this.getGraphicsState().getNonStrokingColor().getJavaColor().getGreen();
int b = this.getGraphicsState().getNonStrokingColor().getJavaColor().getBlue();
int rgb = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB();
float []cosp = this.getGraphicsState().getNonStrokingColor().getColorSpaceValue();
PDColorSpace pd = this.getGraphicsState().getNonStrokingColor().getColorSpace();
break;
case PDTextState.RENDERING_MODE_STROKE_TEXT:
System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString());
System.out.println(this.getGraphicsState().getStrokingColor().getJavaColor().getRGB());
break;
case PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT:
//basic support for text rendering mode "invisible"
Color nsc = this.getGraphicsState().getStrokingColor().getJavaColor();
float[] components = {Color.black.getRed(),Color.black.getGreen(),Color.black.getBlue()};
Color c1 = new Color(nsc.getColorSpace(),components,0f);
System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString());
break;
default:
System.out.println(this.getGraphicsState().getNonStrokeJavaComposite().toString());
System.out.println(this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB());
}
I am using the above code. The values getting are r = 0, g = 0, b = 0, inside cosp object value is [0.0], inside pd object array = null and colorSpace = null. and RGB value is always -16777216. Please help me. Thanks in advance.
我正在使用上面的代码。得到的值是 r = 0, g = 0, b = 0, 在 cosp 对象中值为 [0.0], 在 pd 对象数组中 = null 和 colorSpace = null。并且 RGB 值始终为 -16777216。请帮我。提前致谢。
采纳答案by demongolem
I tried the code in the link you posted and it worked for me. The colors I get back are 148.92, 179.01001 and 214.965. I wish I could give you my PDF to work with, maybe if I store it externally to SO? My PDF used a sort of palish blue color and that seems to match. It was just one page of text created in Word 2010 and exported, nothing too intense.
我尝试了您发布的链接中的代码,它对我有用。我得到的颜色是 148.92、179.01001 和 214.965。我希望我可以给你我的 PDF 来使用,也许如果我将它存储在 SO 的外部?我的 PDF 使用了一种淡蓝色,看起来很匹配。它只是在 Word 2010 中创建并导出的一页文本,没有太强烈的内容。
A couple of suggestions ....
几个建议......
- Recall that the value returned is a float between 0 and 1. If a value is accidentally cast to int, then of course the values will end up containing nearly all 0. The linked to code multiples by 255 to get a range of 0 to 255.
- As the commenter said, the most common color for a PDF file is black which is 0 0 0
- 回想一下,返回的值是一个介于 0 和 1 之间的浮点数。如果一个值意外地转换为 int,那么当然这些值最终将包含几乎所有的 0。链接到代码乘以 255 以获得 0 到 255 的范围.
- 正如评论者所说,PDF 文件最常见的颜色是黑色,即 0 0 0
That is all I can think of now, otherwise I have version of 1.7.1 of pdfbox and fontbox and like I said I pretty much followed the link you gave.
这就是我现在能想到的全部,否则我有 pdfbox 和 fontbox 的 1.7.1 版本,就像我说的那样,我几乎按照您提供的链接进行操作。
EDIT
编辑
Based upon my comments, here perhaps is a minorly invasive way of doing it for pdf files like color.pdf
?
根据我的评论,对于 pdf 文件,这可能是一种微创方式,例如color.pdf
?
In PDFStreamEngine.java
in the processOperator
method one can do inside the try block
在PDFStreamEngine.java
的processOperator
方法可以try块内做
if (operation.equals("RG")) {
// stroking color space
System.out.println(operation);
System.out.println(arguments);
} else if (operation.equals("rg")) {
// non-stroking color space
System.out.println(operation);
System.out.println(arguments);
} else if (operation.equals("BT")) {
System.out.println(operation);
} else if (operation.equals("ET")) {
System.out.println(operation);
}
This will show you the information, then it is up to you to process the color information for each section according to your needs. Here is a snippet from the beginning of the output of the above code when run on color.pdf
...
这将显示信息,然后由您根据需要处理每个部分的颜色信息。这是上面代码在运行时输出开头的片段color.pdf
...
BT
rG
[COSInt(1), COSInt(0), CosInt(0)]
RG
[COSInt(1), COSInt(0), CosInt(0)]
ET
BT
ET
BT
rG
[COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}]
RG
[COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}]
ET
......
BT
rG
[COSInt(1), COSInt(0), CosInt(0)]
RG
[COSInt(1), COSInt(0), CosInt(0)]
ET
BT
ET
BT
rG
[COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}]
RG
[COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}]
ET
......
You see in the above output an empty BT ET section, this being a section which is marked DEVICEGRAY. All the other give you [0,1] values for the R, G and B components
您在上面的输出中看到一个空的 BT ET 部分,这是一个标记为 DEVICEGRAY 的部分。所有其他为您提供 R、G 和 B 分量的 [0,1] 值
回答by kiranbkrishna
I also ended up doing something like this. Pasting code below, hope it helps someone.
我也最终做了这样的事情。粘贴代码如下,希望对大家有所帮助。
import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.pdmodel.graphics.PDGraphicsState;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.ResourceLoader;
import org.apache.pdfbox.util.TextPosition;
public class Parser extends PDFTextStripper {
public Parser() throws IOException {
super(ResourceLoader.loadProperties(
"org/apache/pdfbox/resources/PageDrawer.properties", true));
super.setSortByPosition(true);
}
public void parse(String path) throws IOException{
PDDocument doc = PDDocument.load(path);
List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
for (PDPage page : pages) {
this.processStream(page, page.getResources(), page.getContents().getStream());
}
}
@Override
protected void processTextPosition(TextPosition text) {
try {
PDGraphicsState graphicsState = getGraphicsState();
System.out.println("R = " + graphicsState.getNonStrokingColor().getJavaColor().getRed());
System.out.println("G = " + graphicsState.getNonStrokingColor().getJavaColor().getGreen());
System.out.println("B = " + graphicsState.getNonStrokingColor().getJavaColor().getBlue());
}
catch (IOException ioe) {}
}
public static void main(String[] args) throws IOException, COSVisitorException {
Parser p = new Parser();
p.parse("/Users/apple/Desktop/123.pdf");
}
}
回答by Jubin Patel
I found some code in one of my maintenance program.
I do not know it works for you or not, please try It.
Also check out this link http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/common/class-use/PDStream.html
我在我的一个维护程序中找到了一些代码。
我不知道它是否适合您,请尝试一下。另请查看此链接http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/common/class-use/PDStream.html
It may help you
它可能会帮助你
PDDocument doc = null;
try {
doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
engine.processStream(page, page.findResources(), page.getContents().getStream());
PDGraphicsState graphicState = engine.getGraphicsState();
System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
for (float c : colorSpaceValues) {
System.out.println(c * 255);
}
}
finally {
if (doc != null) {
doc.close();
}
回答by Robert Lill
With the pdfbox verson 2.0+ it is necessary to choose these operators in the constructor of your overwritten PDFTextStripper:
对于 pdfbox 版本 2.0+,有必要在覆盖的 PDFTextStripper 的构造函数中选择这些运算符:
addOperator(new SetStrokingColorSpace());
addOperator(new SetNonStrokingColorSpace());
addOperator(new SetStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceRGBColor());
addOperator(new SetStrokingDeviceRGBColor());
addOperator(new SetNonStrokingDeviceGrayColor());
addOperator(new SetStrokingDeviceGrayColor());
addOperator(new SetStrokingColor());
addOperator(new SetStrokingColorN());
addOperator(new SetNonStrokingColor());
addOperator(new SetNonStrokingColorN());
Only then getGraphicsState() will return proper information.
只有这样 getGraphicsState() 才会返回正确的信息。