使用 Java PDFBox 库编写俄语 PDF

Question

提问by Brad

I am using a Java library called PDFBoxtrying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the PDF the letters appeared so strange. It seems the problem is in the font used, but i am not so sure about that, so i hope if anyone could guide me through this. Here is the important code lines :

我正在使用名为PDFBox的 Java 库尝试将文本写入 PDF。它非常适合英文文本，但是当我尝试在 PDF 中写入俄文文本时，这些字母看起来很奇怪。问题似乎出在使用的字体上，但我对此不太确定，所以我希望是否有人可以指导我完成此操作。这是重要的代码行：

PDTrueTypeFont font = PDTrueTypeFont.loadTTF( pdfFile, new File( "fonts/VREMACCI.TTF" ) );  // Windows Russian font imported to write the Russian text.
font.setEncoding( new WinAnsiEncoding() );  // Define the Encoding used in writing.
// Some code here to open the PDF & define a new page.
contentStream.drawString( "отделом компьютерной" ); // Write the Russian text.

The WinAnsiEncoding source code is : Click here

WinAnsiEncoding 源代码是：单击此处

--------------------- Edit on 18 November 2009

--------------------- 2009 年 11 月 18 日编辑

After some investigation, i am now sure it is an Encoding problem, this could be solved by defining my own Encoding using the helpful PDFBox class called DictionaryEncoding.

经过一番调查，我现在确定这是一个编码问题，这可以通过使用名为DictionaryEncoding的有用 PDFBox 类定义我自己的 Encoding 来解决。

I am not sure how to use it, but here is what i have tried until now :

我不知道如何使用它，但这是我迄今为止尝试过的：

COSDictionary cosDic = new COSDictionary();
cosDic.setString( COSName.getPDFName("Ercyrillic"), "0420 " ); // Russian letter.
font.setEncoding( new DictionaryEncoding( cosDic ) );

This does not work, as it seems i am filling the dictionary in a wrong way, when i write a PDF page using this it appears blank.

这不起作用，因为我似乎以错误的方式填写字典，当我使用它编写 PDF 页面时，它显示为空白。

The DictionaryEncoding source code is : Click here

DictionaryEncoding 源代码是：单击此处

Answer 1

采纳答案by ivict

Try to use this construction:

尝试使用这种结构：

PDFont font = PDType0Font.load( pdfFile, new File( "fonts/VREMACCI.TTF" ) );  // Windows Russian font imported to write the Russian text.
// Some code here to open the PDF & define a new page.
contentStream.beginText();
contentStream.setFont(font, 12);
contentStream.showText( "отделом компьютерной" ); // Write the Russian text.
contentStream.endText();

Answer 2

回答by plinth

The long story is this - in order to do unicode output in PDF from a TrueType font, the output must include a ton of detailed and seemingly superfluous information. What it comes down to is this - inside a TrueType font the glyphs are stored as glyph ids. These glyph ids are associated with a particular unicode character (and IIRC, a unicode glyph internally may refer to several code points - like é referring to e and an acute accent - my memory is hazy). PDF doesn't really have unicode support other than to say that there exists a mapping from UTF16BE values in a string to glyph ids in a TrueType font as well as a mapping from UTF16BE values to Unicode - even if it's identity.

长话短说 - 为了从 TrueType 字体在 PDF 中进行 unicode 输出，输出必须包含大量详细且看似多余的信息。归根结底是这样的 - 在 TrueType 字体中，字形存储为字形 ID。这些字形 ID 与特定的 unicode 字符相关联（而 IIRC，一个 unicode 字形在内部可能指的是几个代码点 - 比如 é 指的是 e 和一个重音符号 - 我的记忆是模糊的）。除了说存在从字符串中的 UTF16BE 值到 TrueType 字体中的字形 ID 的映射以及从 UTF16BE 值到 Unicode 的映射之外，PDF 并没有真正支持 unicode - 即使它是身份。

a Font dictionary of Subtype Type0 with
- a DescendantFonts array with an entry described below
- a ToUnicode entry that maps UTF16BE values to unicode
- an Encoding set to Identity-H

子类型 Type0 的字体字典
- 具有如下所述条目的 DescendantFonts 数组
- 将 UTF16BE 值映射到 unicode 的 ToUnicode 条目
- 设置为 Identity-H 的编码

Output from one of my unit tests on my own tools looks like this:

我在自己的工具上进行的单元测试之一的输出如下所示：

13 0 obj
<< 
   /BaseFont /DejaVuSansCondensed 
   /DescendantFonts [ 4 0 R  ]   
   /ToUnicode 14 0 R 
   /Type /Font 
   /Subtype /Type0 
   /Encoding /Identity-H 
>> endobj

14 0 obj
<< /Length 346 >> stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS
def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 1
beginbfrange <0000> <FFFF> <0000> endbfrange endcmap CMapName currentdict /CMap
defineresource pop end end

endstream % note that the formatting is wrong for the stream

endstream % 注意流的格式是错误的

a Font dictionary of Subtype CIDFontTYpe2 with
- a CIDSsytemInfo
- a FontDescriptor
- DW and W
- a CIDToGIDMap that maps from character ID to glyph ID

子类型 CIDFontTYpe2 的字体字典
- 一个 CIDS 系统信息
- 一个字体描述符
- DW和W
- 从字符 ID 映射到字形 ID 的 CIDToGIDMap

Here's the one from the same test - this is the object in the DescendantFonts array:

这是来自同一测试的测试 - 这是 DescendantFonts 数组中的对象：

4 0 obj
<< 
   /Subtype /CIDFontType2 
   /Type /Font 
   /BaseFont /DejaVuSansCondensed 
   /CIDSystemInfo 8 0 R 
   /FontDescriptor 9 0 R 
   /DW 1000 
   /W 10 0 R 
   /CIDToGIDMap 11 0 R 
>>

8 0 obj
<< 
   /Registry (Adobe)
   /Ordering (UCS)
   /Supplement 0 
>>
endobj

Why am I telling you this? What does it have to do with PDFBox? Just this: Unicode output in PDF is, frankly, a royal pain in the butt. Acrobat was developed before there was Unicode and it was painful from the start to have CJK encodings without Unicode (I know - I worked on Acrobat then). Later Unicode support was added, but it really felt like it was glommed on. One would hope that you would just say /Encoding /Unicode and have strings that start with the thorn and y-dieresis characters and off you go. No such luck. If you don't put in every detailed thing (and really, Acrobat, embedding a PostScript program to translate to Unicode? WTH?), you get a blank page in Acrobat. I swear, I am not making this up.

我为什么要告诉你这些？它与 PDFBox 有什么关系？只是这样：坦率地说，PDF 中的 Unicode 输出是一种皇家痛苦。Acrobat 是在 Unicode 出现之前开发的，从一开始就使用没有 Unicode 的 CJK 编码很痛苦（我知道 - 我当时在 Acrobat 上工作）。后来添加了 Unicode 支持，但真的感觉像是被蒙上了一层阴影。人们会希望您只说 /Encoding /Unicode 并拥有以 thorn 和 y-dieresis 字符开头的字符串，然后就可以了。没有这样的运气。如果您不输入所有详细信息（实际上，Acrobat，嵌入 PostScript 程序以转换为 Unicode？WTH？），您将在 Acrobat 中得到一个空白页面。我发誓，这不是我编造的。

At this point, I write PDF generation tools for a separate company (.NET right now, so it won't help you), and I made it a design goal to hide all that nonsense. All text is unicode - if you only use those character codes that are the same a WinAnsi, that's what you get under the hood. Use anything else, you get all this other stuff with it. I'd be surprised if PDFBox does that work for you - it is a serious hassle.

在这一点上，我为一家独立的公司（现在是 .NET，所以它不会帮助你）编写 PDF 生成工具，我把它作为一个设计目标，以隐藏所有这些废话。所有文本都是 unicode - 如果您只使用与 WinAnsi 相同的字符代码，这就是您所得到的。使用其他任何东西，您就可以获得所有其他东西。如果 PDFBox 为您工作，我会感到惊讶 - 这是一个严重的麻烦。

Answer 3

回答by George Tsamis

The solution is very Simple.

解决方法很简单。

1) You must find fonts compatible with the characters you want to display.
2) Download locally the .ttf file of the fonts.
3) Load fonts from your application

1) 您必须找到与要显示的字符兼容的字体。
2) 将字体的.ttf 文件下载到本地。
3) 从您的应用程序加载字体

For Example this is what you have to do in case you want to use Greek characters:

例如，如果您想使用希腊字符，这就是您必须执行的操作：

content = new PDPageContentStream(document, page);
pdfFont = PDType0Font.load( document, new File( "arialuni.ttf" ) )
content.setFont(pdfFont, fontSize);

Answer 4

回答by PhiLho

Perhaps the Russian encoding class need to be written, it should look like the WinAnsiEncodingone, I suppose.
Now, I have no idea what to put there!

也许需要编写俄语编码类，我想它应该看起来像WinAnsiEncoding 类。
现在，我不知道该放什么！

Or, if that's not what you do already, perhaps you should encode your source file in UTF-8 and use a default encoding.
I saw some messages related to issues with extracting Russian text from existing PDF files (using PDFBox of course) but I don't know if output is related.
You can also write to the PDFBox mailing list.

或者，如果您还没有这样做，也许您应该将源文件编码为 UTF-8 并使用默认编码。
我看到了一些与从现有 PDF 文件（当然使用 PDFBox）中提取俄语文本的问题相关的消息，但我不知道输出是否相关。
您也可以写信到 PDFBox 邮件列表。

Answer 5

回答by Kevin Day

Testing whether this is an encoding issue should be pretty easy to do (just switch to UTF16 encoding).

测试这是否是编码问题应该很容易（只需切换到 UTF16 编码）。

I'm assuming that you've tried using an editor or something with the VREMACCI font and confirmed that it displays the way you expect it to?

我假设您已经尝试使用编辑器或带有 VREMACCI 字体的东西并确认它以您期望的方式显示？

You might want to try doing the same thing in iText just to get a feel for whether the issue is related to the PdfBox library itself... If your primary goal is to generate PDF files, iText might be a better solution anyway.

您可能想尝试在 iText 中做同样的事情只是为了了解问题是否与 PdfBox 库本身有关...如果您的主要目标是生成 PDF 文件，无论如何 iText 可能是更好的解决方案。

EDIT - long answer to comments:

编辑 - 对评论的长回答：

ok - sorry for the back and forth on the encoding question... Your core issue (which you probably already knew) is that the encoding of the bytes being written to the content stream is different than the encoding being used to look up glyphs. Now I'll try to actually be helpful:

好的 - 很抱歉在编码问题上来来回回......您的核心问题（您可能已经知道）是写入内容流的字节编码与用于查找字形的编码不同。现在，我将尝试真正提供帮助：

I took a look at the dictionary encoding class in PdfBox, and it looks quite unintuitive... The 'dictionary' in question is a PDF dictionary. So what you'll basically need to do is create a Pdf dictionary object (I think that PdfBox calls this a type of COSObject), then add entries to it.

我查看了 PdfBox 中的字典编码类，它看起来很不直观......有问题的“字典”是一个 PDF 字典。因此，您基本上需要做的是创建一个 Pdf 字典对象（我认为 PdfBox 将其称为一种 COSObject），然后向其中添加条目。

The encoding for a font is defined in PDF as a dictionary (see page 266 of the above spec). The dictionary contains a base encoding name, plus an optional differences array. Technically, the differences array should not be used with true-type fonts (although I've seen it used in some cases - don't use it, though).

字体的编码在 PDF 中定义为字典（参见上述规范的第 266 页）。该字典包含一个基本编码名称，以及一个可选的差异数组。从技术上讲，差异数组不应该与真正类型的字体一起使用（尽管我已经看到它在某些情况下使用过 - 但是不要使用它）。

You will then specify an entry for the cmap for the encoding. This cmap will be the encoding of your font.

然后，您将为用于编码的 cmap 指定一个条目。这个 cmap 将是你字体的编码。

My suggestion here is to take an existing PDF that does what you want, then get a dump of the dictionary structure for the font so you can see what it looks like.

我的建议是采用现有的 PDF 来执行您想要的操作，然后获取该字体的字典结构的转储，以便您可以查看它的外观。

This is definitely not for the faint of heart. I can provide some help - if you need a dictionary dump, shoot me a hyperlink with a sample PDF and I'll run it through some of the algorithms I use in my iText development (I'm the maintainer of the iText text extraction sub-system).

这绝对不适合胆小的人。我可以提供一些帮助——如果你需要字典转储，给我一个带有示例 PDF 的超链接，我将通过我在 iText 开发中使用的一些算法运行它（我是 iText 文本提取子的维护者-系统）。

EDIT - 11/17/09

编辑 - 11/17/09

OK - here's the dictionary dump from the russian.pdf file (sub-dictionaries are listed indented, and in the order they appeared in the containing dictionary):

好的 - 这是来自 russian.pdf 文件的字典转储（子字典被缩进列出，并按照它们在包含字典中出现的顺序）：

(/CropBox=[0, 0, 595, 842], /Parent=Dictionary of type: /Pages, /Type=/Page, /Contents=[209 0 R, 210 0 R, 211 0 R, 214 0 R, 215 0 R, 216 0 R, 222 0 R, 223 0 R], /Resources=Dictionary, /MediaBox=[0, 0, 595, 842], /StructParents=0, /Rotate=0)
    Subdictionary /Parent = (/Type=/Pages, /Count=6, /Kids=[195 0 R, 1 0 R, 3 0 R, 5 0 R, 7 0 R, 9 0 R])
    Subdictionary /Resources = (/ExtGState=Dictionary, /ProcSet=[/PDF, /Text], /ColorSpace=Dictionary, /Font=Dictionary, /Properties=Dictionary)
        Subdictionary /ExtGState = (/GS0=Dictionary of type: /ExtGState)
            Subdictionary /GS0 = (/OPM=1, /op=false, /Type=/ExtGState, /SA=false, /OP=false, /SM=0.02)
        Subdictionary /ColorSpace = (/CS0=[/ICCBased, 228 0 R])
        Subdictionary /Font = (/C2_1=Dictionary of type: /Font, /C2_2=Dictionary of type: /Font, /C2_3=Dictionary of type: /Font, /C2_4=Dictionary of type: /Font, /TT2=Dictionary of type: /Font, /TT1=Dictionary of type: /Font, /TT0=Dictionary of type: /Font, /C2_0=Dictionary of type: /Font, /TT3=Dictionary of type: /Font)
            Subdictionary /C2_1 = (/DescendantFonts=[243 0 R], /BaseFont=/LDMIEC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /C2_2 = (/DescendantFonts=[233 0 R], /BaseFont=/LDMIBO+TimesNewRomanPSMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /C2_3 = (/DescendantFonts=[224 0 R], /BaseFont=/LDMIHD+TimesNewRomanPS-ItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /C2_4 = (/DescendantFonts=[229 0 R], /BaseFont=/LDMIDA+Tahoma, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /TT2 = (/LastChar=58, /BaseFont=/LDMIFC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 333], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=136, /Descent=-216, /FontWeight=700, /FontBBox=[-558, -307, 2000, 1026], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMIFC+TimesNewRomanPS-BoldMT, /Ascent=891, /ItalicAngle=0)
            Subdictionary /TT1 = (/LastChar=187, /BaseFont=/LDMICP+TimesNewRomanPSMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 833, 778, 0, 333, 333, 0, 0, 250, 333, 250, 278, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 278, 278, 0, 564, 0, 444, 0, 722, 667, 667, 722, 611, 556, 0, 722, 333, 389, 0, 611, 889, 722, 722, 556, 0, 667, 556, 611, 0, 722, 944, 0, 722, 0, 333, 0, 333, 0, 500, 0, 444, 500, 444, 500, 444, 333, 500, 500, 278, 0, 500, 278, 778, 500, 500, 500, 0, 333, 389, 278, 500, 500, 722, 0, 500, 444, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=82, /Descent=-216, /FontWeight=400, /FontBBox=[-568, -307, 2000, 1007], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMICP+TimesNewRomanPSMT, /Ascent=891, /ItalicAngle=0)
            Subdictionary /TT0 = (/LastChar=55, /BaseFont=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 500, 500, 500, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=116.867004, /Descent=-216, /FontWeight=700, /FontBBox=[-547, -307, 1206, 1032], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=98, /XHeight=468, /FontFamily=Times New Roman, /FontName=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Ascent=891, /ItalicAngle=-15)
            Subdictionary /C2_0 = (/DescendantFonts=[238 0 R], /BaseFont=/LDMHPN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
            Subdictionary /TT3 = (/LastChar=169, /BaseFont=/LDMIEB+Tahoma, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[313, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 546, 0, 546, 0, 0, 546, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 929], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
                Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=92, /Descent=-206, /FontWeight=400, /FontBBox=[-600, -208, 1338, 1034], /CapHeight=734, /FontFile2=Stream, /FontStretch=/Normal, /Flags=32, /XHeight=546, /FontFamily=Tahoma, /FontName=/LDMIEB+Tahoma, /Ascent=1000, /ItalicAngle=0)
        Subdictionary /Properties = (/MC0=Dictionary of type: /OCMD)
            Subdictionary /MC0 = (/Type=/OCMD, /OCGs=Dictionary of type: /OCG)
                Subdictionary /OCGs = (/Usage=Dictionary, /Type=/OCG, /Name=HeaderFooter)
                    Subdictionary /Usage = (/CreatorInfo=Dictionary, /PageElement=Dictionary)
                        Subdictionary /CreatorInfo = (/Creator=Acrobat PDFMaker 6.0 ??? Word)
                        Subdictionary /PageElement = (/SubType=/HF)

there's a lot of moving parts here. you might want to put together a test document that has only 3 or 4 characters in the font in question... There are a lot of type-1 fonts being used here (in addition to the TT fonts), so it's hard to tell what is involved in your particular issue.

这里有很多活动部件。您可能想将一个测试文档放在一起，该文档中的字体只有 3 或 4 个字符......这里使用了很多 type-1 字体（除了 TT 字体），所以很难说您的特定问题涉及什么。

(Are you sure you don't want to at least try this with iText? ;-) I'm not saying that it'll work, just that it might be worth a shot ).

（你确定你至少不想用 iText 试试这个吗？;-) 我不是说它会起作用，只是它可能值得一试）。

For reference, the above dictionary dump was obtained using the com.lowagie.text.pdf.parser.PdfContentReaderTool class

作为参考，上面的字典转储是使用 com.lowagie.text.pdf.parser.PdfContentReaderTool 类获得的

Answer 6

回答by daNIL

Just try this one:

试试这个：

Phrase leftTitle = new Phrase("САНКТ-ПЕТЕРБУРГ", FontFactory.getFont("Tahoma", "Cp1251", true, 25));

This will work at least with latest (5.0.1) iText

这至少适用于最新的 (5.0.1) iText

使用 Java PDFBox 库编写俄语 PDF

提问by Brad

采纳答案by ivict

回答by plinth

回答by George Tsamis

回答by PhiLho

回答by Kevin Day

回答by daNIL

相关推荐

最近更新

标签

使用 Java PDFBox 库编写俄语 PDF

提问by Brad

采纳答案by ivict

回答by plinth

回答by George Tsamis

回答by PhiLho

回答by Kevin Day

回答by daNIL

相关推荐

Java Web 服务：使用 DataHandler 类发送文件

java 如何检查数组的末尾

Java 删除空的 XML 标签

java.lang.NoClassDefFoundError: sun/awt/X11GraphicsEnvironment 问题在 linux 上运行我们基于小程序的应用程序时面临

相关推荐

最近更新

标签