在 C# 中使用 iTextSharp 读取 pdf 内容

Question

提问by Shahin

I use this code to read pdf content using iTextSharp. it works fine when content is english but it doesn't work whene content is Persian or Arabic
Result is something like this :
Hereis sample non-English PDF for test.

我使用此代码使用 iTextSharp 读取 pdf 内容。当内容是英语时它工作正常但当内容是波斯语或阿拉伯语时它不起作用
结果是这样的：
这是用于测试的示例非英语PDF。

ù?ù?ù??§ ùù”?¨ù??·?? ??????ù?ù? ?2???§ ùù?ù?-ù” ù?ù?ù…?- ??ù”?¨ù??3 ?? Karl Seguin foppersian.codeplex.com www.codebetter.com 1 1 ùù”?¨ù??·?? ù?ù?ù??§ ??????ù?ù?
ù?ù…?§ù??±?¨ ù?ù??μ?§ ???3??ù?ù?  ù…?±ù? ?ˉ??ù?ù??a ?±?aù??¨ ?±?§?2ù?§

ù?ù?ù??§ùù”?¨ù??·?? ??????ù?ù? ?2???§ùù?ù?-ù”ù?ù?ù…?- ??ù”?¨ù??3 ?? Karl Seguin foppersian.codeplex.com www.codebetter.com 1 1 ù”?¨ù??·?? ù?ù?ù??§ ??????ù?ù?
ù?ù…?§ù??±?¨ ù?ù??μ?§ ???3??ù?ù?  ù…?±ù? ?ˉ??ù?ù??a ?±?aù??¨ ?±?§?2ù?§

What is the solution ?

解决办法是什么？

  public string ReadPdfFile(string fileName)
        {
            StringBuilder text = new StringBuilder();

            if (File.Exists(fileName))
            {
                PdfReader pdfReader = new PdfReader(fileName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
                    text.Append(currentText);
                    pdfReader.Close();
                }
            }
            return text.ToString();
        }

Answer 1

采纳答案by Chris Haas

In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn't matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn't make sense and will almost always fail.

在 .Net 中，一旦你有了一个字符串，你就有了一个 string，它是 Unicode，总是。实际的内存中实现是 UTF-16，但这并不重要。永远，永远，永远不要将字符串分解为字节，并尝试将其重新解释为不同的编码并将其作为字符串重新解释，因为这没有意义并且几乎总是会失败。

Your problem is this line:

你的问题是这一行：

currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));

I'm going to pull it apart into a couple of lines to illustrate:

我将把它分成几行来说明：

byte[] bytes = Encoding.UTF8.GetBytes("?"); //bytes now holds 0xDB8C
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
string final = Encoding.UTF8.GetString(converted);//final now holds ??

The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.

该代码将混淆 127 ASCII 屏障以上的任何内容。删除重新编码行，你应该很好。

Side-note, it is totally possible that whatever creates a string does it incorrectly, that's not too uncommon actually. But you need to fix that problem beforeit becomes a string, at the bytelevel.

旁注，完全有可能创建一个字符串的任何东西都做错了，这实际上并不少见。但是您需要在该问题成为,级别之前解决该问题。stringbyte

EDIT

编辑

The code should be the exact same as yours above except that one line should be removed. Also, whatever you're using to display the text in, make sure that it supports Unicode. Also, as @kuujinbo said, make sure that you're using a recent version of iTextSharp. I tested this with 5.2.0.0.

代码应该与上面的完全相同，只是应该删除一行。此外，无论您使用什么来显示文本，请确保它支持 Unicode。另外，正如@kuujinbo 所说，请确保您使用的是最新版本的 iTextSharp。我用 5.2.0.0 对此进行了测试。

    public string ReadPdfFile(string fileName) {
        StringBuilder text = new StringBuilder();

        if (File.Exists(fileName)) {
            PdfReader pdfReader = new PdfReader(fileName);

            for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                text.Append(currentText);
            }
            pdfReader.Close();
        }
        return text.ToString();
    }

EDIT 2

编辑 2

The above code fixes the encoding issue but doesn't fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.

上面的代码修复了编码问题，但没有修复字符串本身的顺序。不幸的是，这个问题似乎出在 PDF 级别本身。

Consequently, showing text in such right-to-left writing systems requires either positioning each glyph individually (which is tedious and costly) or representing text with show strings (see 9.2, “Organization and Use of Fonts”) whose character codes are given in reverse order.

因此，在这种从右到左的书写系统中显示文本需要单独定位每个字形（这既乏味又昂贵）或用显示字符串表示文本（请参阅第 9.2 节“字体的组织和使用”），其字符代码在相反的顺序。

PDF 2008 Spec - 14.8.2.3.3 - Reverse-Order Show Strings

PDF 2008 规范 - 14.8.2.3.3 - 逆序显示字符串

When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a "marked content" section, BMC. However, the few sample PDFs that I've looked at and generated don't appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you'll have to poke around so more.

重新排序上述字符串时，内容（如果我正确理解规范）应该使用“标记内容”部分，BMC. 但是，我查看和生成的少数示例 PDF 似乎并没有真正做到这一点。在这方面我绝对可能是错的，因为这不是我的专长，所以你必须更多地探索。

在 C# 中使用 iTextSharp 读取 pdf 内容

提问by Shahin

采纳答案by Chris Haas

相关推荐

最近更新

标签

在 C# 中使用 iTextSharp 读取 pdf 内容

提问by Shahin

采纳答案by Chris Haas

相关推荐

C# 使用 XmlTextReader

C# 如何在一个页面中设置 cookie 值并从 asp.net 网站的另一个页面读取它

C# 在 FTP 上上传文件

C# Protobuf-net 序列化/反序列化

相关推荐

最近更新

标签