C# 在将 Html 转换为 Pdf 时显示 Unicode 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10329863/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-09 13:25:13  来源:igfitidea点击:

Display Unicode characters in converting Html to Pdf

c#unicodeitext

提问by NIlesh Lanke

I am using itextsharp dll to convert HTML to PDF.

我正在使用 itextsharp dll 将 HTML 转换为 PDF。

The HTML has some Unicode characters like α, β... when I try to convert HTML to PDF, Unicode characters are not shown in PDF.

HTML 有一些 Unicode 字符,如 α、β... 当我尝试将 HTML 转换为 PDF 时,Unicode 字符未显示在 PDF 中。

My function:

我的功能:

Document doc = new Document(PageSize.LETTER);

using (FileStream fs = new FileStream(Path.Combine("Test.pdf"), FileMode.Create, FileAccess.Write, FileShare.Read))
{
    PdfWriter.GetInstance(doc, fs);

    doc.Open();
    doc.NewPage();

    string arialuniTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts),
                                      "ARIALUNI.TTF");

    BaseFont bf = BaseFont.CreateFont(arialuniTff, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

    Font fontNormal = new Font(bf, 12, Font.NORMAL);

    List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()),
                                                 new StyleSheet());
    Paragraph p = new Paragraph {Font = fontNormal};

    foreach (var element in list)
    {
        p.Add(element);
        doc.Add(p);
    }

    doc.Close();
}

采纳答案by Chris Haas

When dealing with Unicode characters and iTextSharp there's a couple of things you need to take care of. The first one you did already and that's getting a font that supports your characters. The second thing is that you want to actually register the font with iTextSharp so that its aware of it.

在处理 Unicode 字符和 iTextSharp 时,您需要注意几件事。你已经做的第一个,那就是得到一种支持你的角色的字体。第二件事是你想用 iTextSharp 实际注册字体,以便它知道它。

//Path to our font
string arialuniTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.TTF");
//Register the font with iTextSharp
iTextSharp.text.FontFactory.Register(arialuniTff);

Now that we have a font we need to create a StyleSheetobject that tells iTextSharp when and how to use it.

现在我们有了一个字体,我们需要创建一个StyleSheet对象来告诉 iTextSharp 何时以及如何使用它。

//Create a new stylesheet
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
//Set the default body font to our registered font's internal name
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Arial Unicode MS");

The one non-HTML part that you also need to do is set a special encodingparameter. This encoding is specific to iTextSharp and in your case you want it to be Identity-H. If you don't set this then it default to Cp1252(WINANSI).

您还需要做的一个非 HTML 部分是设置一个特殊encoding参数。此编码特定于 iTextSharp,在您的情况下,您希望它是Identity-H. 如果您不设置此项,则默认为Cp1252( WINANSI)。

//Set the default encoding to support Unicode characters
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);

Lastly, we need to pass our stylesheet to the ParseToListmethod:

最后,我们需要将我们的样式表传递给ParseToList方法:

//Parse our HTML using the stylesheet created above
List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()), ST);

Putting that all together, from open to close you'd have:

将所有这些放在一起,从打开到关闭,您将拥有:

doc.Open();

//Sample HTML
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.Append(@"<p>This is a test: <strong>α,β</strong></p>");

//Path to our font
string arialuniTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.TTF");
//Register the font with iTextSharp
iTextSharp.text.FontFactory.Register(arialuniTff);

//Create a new stylesheet
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
//Set the default body font to our registered font's internal name
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Arial Unicode MS");
//Set the default encoding to support Unicode characters
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);

//Parse our HTML using the stylesheet created above
List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()), ST);

//Loop through each element, don't bother wrapping in P tags
foreach (var element in list) {
    doc.Add(element);
}

doc.Close();

EDIT

编辑

In your comment you show HTML that specifies an override font. iTextSharp does not spider the system for fonts and its HTML parser doesn't use font fallback techniques. Any fonts specified in HTML/CSS must be manually registered.

在您的评论中,您显示指定覆盖字体的 HTML。iTextSharp 不会搜索字体系统,并且其 HTML 解析器不使用字体回退技术。HTML/CSS 中指定的任何字体都必须手动注册。

string lucidaTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "l_10646.ttf");
iTextSharp.text.FontFactory.Register(lucidaTff);

回答by Gregor Slavec

You can also use the new XMLWorkerHelper(from library itextsharp.xmlworker), you need to override the default FontFactory implementation however.

您还可以使用新的XMLWorkerHelper(来自库itextsharp.xmlworker),但是您需要覆盖默认的 FontFactory 实现。

void GeneratePdfFromHtml()
{
  const string outputFilename = @".\Files\report.pdf";
  const string inputFilename = @".\Files\report.html";

  using (var input = new FileStream(inputFilename, FileMode.Open))
  using (var output = new FileStream(outputFilename, FileMode.Create))
  {
    CreatePdf(input, output);
  }
}

void CreatePdf(Stream htmlInput, Stream pdfOutput)
{
  using (var document = new Document(PageSize.A4, 30, 30, 30, 30))
  {
    var writer = PdfWriter.GetInstance(document, pdfOutput);
    var worker = XMLWorkerHelper.GetInstance();

    document.Open();
    worker.ParseXHtml(writer, document, htmlInput, null, Encoding.UTF8, new UnicodeFontFactory());

    document.Close();
  }    
}

public class UnicodeFontFactory : FontFactoryImp
{
    private static readonly string FontPath = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts),
      "arialuni.ttf");

    private readonly BaseFont _baseFont;

    public UnicodeFontFactory()
    {
      _baseFont = BaseFont.CreateFont(FontPath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

    }

    public override Font GetFont(string fontname, string encoding, bool embedded, float size, int style, BaseColor color,
      bool cached)
    {
      return new Font(_baseFont, size, style, color);
    }
}

回答by Code Scratcher

Here is the few steps to display unicode characters in converting Html to Pdf

这是在将 Html 转换为 Pdf 时显示 unicode 字符的几个步骤

  1. Create a HTMLWorker
  2. Register a unicode font and assign it
  3. Create a style sheet and set the encoding to Identity-H
  4. Assign the style sheet to the html parser
  1. 创建一个 HTMLWorker
  2. 注册一个 unicode 字体并分配它
  3. 创建一个样式表并将编码设置为 Identity-H
  4. 将样式表分配给 html 解析器

Check below link for more understanding....

查看以下链接以获取更多理解....

Hindi, Turkish, and special characters are also display during converting from HTML to PDF using this method. Check below demo image.

使用此方法从 HTML 转换为 PDF 时也会显示印地语、土耳其语和特殊字符。检查下面的演示图像。

enter image description here

在此处输入图片说明

回答by Milan Hettner

private class UnicodeFontFactory : FontFactoryImp
{
    private BaseFont _baseFont;

    public  UnicodeFontFactory()
    {
        string FontPath = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "arialuni.ttf");
        _baseFont = BaseFont.CreateFont(FontPath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);                
    }

    public override Font GetFont(string fontname, string encoding, bool embedded, float size, int style, BaseColor color, bool cached)
    {                                
        return new Font(_baseFont, size, style, color);
    }
}  

//and Code

//和代码

FontFactory.FontImp = new UnicodeFontFactory();

string convertedHtml = string.Empty;
foreach (char c in htmlText)
{
     if (c < 127)  
           convertedHtml += c;
     else
           convertedHtml += "&#" + (int)c + ";";
}

List<IElement> htmlElements = XMLWorkerHelper.ParseToElementList(convertedHtml, null);

// add the IElements to the document
foreach (IElement htmlElement in htmlElements)
{                            
      document.Add(htmlElement);
}

回答by Frank Thomas

This has to be one of the most difficult problems that I've had to figure out to date. The answers on the web, including stack overflow has either poor or outdated information. The answer from Gregor is very close. I wanted to give back to this community because I spent many hours to get to this answer.

这一定是迄今为止我必须解决的最困难的问题之一。网络上的答案(包括堆栈溢出)要么信息不足,要么信息过时。Gregor 的回答非常接近。我想回馈这个社区,因为我花了很多时间来得到这个答案。

Here's a very simple program I wrote in c# as an example for my own notes.

这是我用c#编写的一个非常简单的程序,作为我自己笔记的示例。

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.tool.xml;

namespace ExampleOfExportingPDF
{
    class Program
    {
        static void Main(string[] args)
        {
            //Build HTML document
            StringBuilder sb = new StringBuilder();
            sb.Append("<body>");
            sb.Append("<h1 style=\"text-align:center;\">これは日本語のテキストの例です。</h1>");
            sb.Append("</body>");

            //Create our document object
            Document Doc = new Document(PageSize.A4);


            //Create our file stream
            using (FileStream fs = new FileStream(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Test.pdf"), FileMode.Create, FileAccess.Write, FileShare.Read))
            {
                //Bind PDF writer to document and stream
                PdfWriter writer = PdfWriter.GetInstance(Doc, fs);

                //Open document for writing
                Doc.Open();


                //Add a page
                Doc.NewPage();

                MemoryStream msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(sb.ToString()));
                XMLWorkerHelper.GetInstance().ParseXHtml(writer, Doc, msHtml, null, Encoding.UTF8, new UnicodeFontFactory());

                //Close the PDF
                Doc.Close();
            }

        }

        public class UnicodeFontFactory : FontFactoryImp
        {
            private static readonly string FontPath = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts),
          "arialuni.ttf");

            private readonly BaseFont _baseFont;

            public UnicodeFontFactory()
            {
                _baseFont = BaseFont.CreateFont(FontPath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

            }

            public override Font GetFont(string fontname, string encoding, bool embedded, float size, int style, BaseColor color,
          bool cached)
            {
                return new Font(_baseFont, size, style, color);
            }
        }

    }
}

Hopefully this will save someone some time in the future.

希望这会在将来为某人节省一些时间。