如何使用 VB.NET 从带有 IDENTITY-H 字体的 PDF 文件中提取文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25331421/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 18:03:12  来源:igfitidea点击:

How to extract text from PDF file with IDENTITY-H fonts using VB.NET

vb.netpdf

提问by WNET

I have a PDF file.

我有一个PDF文件。

I am reading Text from PDF file pro-grammatically using iTextSharp class. It does read Ansi Encoding Texts but It does not read IDENTITY-H Encoding Texts.

我正在使用 iTextSharp 类以编程方式从 PDF 文件中读取文本。它确实读取 Ansi 编码文本,但它不读取 IDENTITY-H 编码文本。

My problem is how to read IDENTITY-H texts from pdf file using VB.Net

我的问题是如何使用 VB.Net 从 pdf 文件中读取 IDENTITY-H 文本

Below is my code:

下面是我的代码:

  1. Public Function ReadPDFFile(ByVal strSource As String) As String

    Dim sbPDFText As New StringBuilder() 'StringBuilder Object To Store read Text
    
    If File.Exists(strSource) Then 'Does File Exist?
        Dim pdfFileReader As New PdfReader(strSource) 'read File
        For intCurrPage As Integer = 1 To pdfFileReader.NumberOfPages 'Loop Through All Pages
    
            Dim lteStrategy As LocTextExtractionStrategy = New LocTextExtractionStrategy 'Read PDF File Content Blocks
            'Get Text
            Dim strCurrText As String = PdfTextExtractor.GetTextFromPage(pdfFileReader, intCurrPage, lteStrategy)
    
            sbPDFText.Append(strCurrText) 'Add Text To String Builder
        Next
        pdfFileReader.Close() 'Close File
    End If
    Return sbPDFText.ToString() 'Return 
    

    End Function

    1. Public Overridable Sub RenderText(ByVal renderInfo As TextRenderInfo) Implements ITextExtractionStrategy.RenderText

      Dim segment As LineSegment = renderInfo.GetBaseline()
      Dim location As New TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth())
      
      If renderInfo.GetText = "" Then
          Console.WriteLine(GetResultantText())
      End If
      With location
          'Chunk Location:
          Debug.Print(renderInfo.GetText)
          .PosLeft = renderInfo.GetDescentLine.GetStartPoint(Vector.I1)
          .PosRight = renderInfo.GetAscentLine.GetEndPoint(Vector.I1)
          .PosBottom = renderInfo.GetDescentLine.GetStartPoint(Vector.I2)
          .PosTop = renderInfo.GetAscentLine.GetEndPoint(Vector.I2)
          'Chunk Font Size: (Height)
          .curFontSize = .PosTop - segment.GetStartPoint()(Vector.I2)
          'Use Font name  and Size as Key in the SortedList
          Dim StrKey As String = renderInfo.GetFont.PostscriptFontName & .curFontSize.ToString
          'Add this font to ThisPdfDocFonts SortedList if it's not already present
          If 1 = 1 Then
              If Not ThisPdfDocFonts.ContainsKey(StrKey) Then ThisPdfDocFonts.Add(StrKey, renderInfo.GetFont)
              'Store the SortedList index in this Chunk, so we can get it later
              .FontIndex = ThisPdfDocFonts.IndexOfKey(StrKey)
              Console.WriteLine(renderInfo.GetFont.ToString & "-->" & StrKey)
          Else
              'pcbContent.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 9)
              .FontIndex = 3
              .curFontSize = 8
          End If
      End With
      locationalResult.Add(location)
      

      End Sub

  1. 公共函数 ReadPDFFile(ByVal strSource As String) As String

    Dim sbPDFText As New StringBuilder() 'StringBuilder Object To Store read Text
    
    If File.Exists(strSource) Then 'Does File Exist?
        Dim pdfFileReader As New PdfReader(strSource) 'read File
        For intCurrPage As Integer = 1 To pdfFileReader.NumberOfPages 'Loop Through All Pages
    
            Dim lteStrategy As LocTextExtractionStrategy = New LocTextExtractionStrategy 'Read PDF File Content Blocks
            'Get Text
            Dim strCurrText As String = PdfTextExtractor.GetTextFromPage(pdfFileReader, intCurrPage, lteStrategy)
    
            sbPDFText.Append(strCurrText) 'Add Text To String Builder
        Next
        pdfFileReader.Close() 'Close File
    End If
    Return sbPDFText.ToString() 'Return 
    

    结束函数

    1. Public Overridable Sub RenderText(ByVal renderInfo As TextRenderInfo) 实现 ITextExtractionStrategy.RenderText

      Dim segment As LineSegment = renderInfo.GetBaseline()
      Dim location As New TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth())
      
      If renderInfo.GetText = "" Then
          Console.WriteLine(GetResultantText())
      End If
      With location
          'Chunk Location:
          Debug.Print(renderInfo.GetText)
          .PosLeft = renderInfo.GetDescentLine.GetStartPoint(Vector.I1)
          .PosRight = renderInfo.GetAscentLine.GetEndPoint(Vector.I1)
          .PosBottom = renderInfo.GetDescentLine.GetStartPoint(Vector.I2)
          .PosTop = renderInfo.GetAscentLine.GetEndPoint(Vector.I2)
          'Chunk Font Size: (Height)
          .curFontSize = .PosTop - segment.GetStartPoint()(Vector.I2)
          'Use Font name  and Size as Key in the SortedList
          Dim StrKey As String = renderInfo.GetFont.PostscriptFontName & .curFontSize.ToString
          'Add this font to ThisPdfDocFonts SortedList if it's not already present
          If 1 = 1 Then
              If Not ThisPdfDocFonts.ContainsKey(StrKey) Then ThisPdfDocFonts.Add(StrKey, renderInfo.GetFont)
              'Store the SortedList index in this Chunk, so we can get it later
              .FontIndex = ThisPdfDocFonts.IndexOfKey(StrKey)
              Console.WriteLine(renderInfo.GetFont.ToString & "-->" & StrKey)
          Else
              'pcbContent.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 9)
              .FontIndex = 3
              .curFontSize = 8
          End If
      End With
      locationalResult.Add(location)
      

      结束子

回答by Bruno Lowagie

Thank you for sharing the PDF document. It helped us to determine that the problem you describe is not an iTextSharp problem. Instead it is a problem with the PDF document itself.

感谢您分享 PDF 文档。它帮助我们确定您描述的问题不是 iTextSharp 问题。相反,它是PDF 文档本身的问题

This problem doesn't have a solution, but I'm providing this answer to explain how you can discover for yourself that the problem also exists when iTextSharp isn't involved.

这个问题没有解决方案,但我提供这个答案是为了解释您如何自己发现当不涉及 iTextSharp 时问题也存在。

Open the document in Adobe Reader. Select the text "Muy se?ores nuestros" and copy/paste it into a text editor. You get "Muy se?ores nuestros". This is text that can be extracted using iTextSharp (it works correctly).

在 Adob​​e Reader 中打开文档。选择文本“Muy se?ores nuestros”并将其复制/粘贴到文本编辑器中。你得到“Muy se?ores nuestros”。这是可以使用 iTextSharp 提取的文本(它工作正常)。

Now do the same with the text "GUARDIAN GLASS EXPRESS, S.L.". You get the following result: "". As you can see, you can not copy/paste the text correctly from Adobe Reader. This is due to the way the text is stored in the PDF. If you can not copy/paste the text from Adobe Reader, you should not expect to be able to extract the text using iTextSharp. The PDF is created in a way that doesn't allow extraction.

现在对文本“GUARDIAN GLASS EXPRESS, SL”执行相同的操作。您会得到以下结果:“”。如您所见,您无法从 Adob​​e Reader 正确复制/粘贴文本。这是由于文本在 PDF 中的存储方式。如果您无法从 Adob​​e Reader 复制/粘贴文本,则不应期望能够使用 iTextSharp 提取文本。PDF 的创建方式不允许提取。

Please take a look at this video to find out some possible causes: https://www.youtube.com/watch?v=wxGEEv7ibHE

请观看此视频以找出一些可能的原因:https: //www.youtube.com/watch?v=wxGEEv7ibHE

I'm sorry that it took so long to figure this out and that it turns out that you're asking something that isn't possible. Your question narrowed the problem down too much, as if the problem was caused by the "IDENTITY-H" encoding and iTextSharp. In reality, you're trying to extract text that can't be extracted.

我很抱歉花了这么长时间才弄清楚这一点,结果你问的是不可能的事情。你的问题把问题缩小了太多,好像问题是由“IDENTITY-H”编码和 iTextSharp 引起的。实际上,您正在尝试提取无法提取的文本。

If you look at the page dictionary inside the PDF, you'll find three font resources for the first (and only) page:

如果您查看 PDF 中的页面字典,您会发现第一(也是唯一)页面的三个字体资源:

enter image description here

在此处输入图片说明

In the content stream (below) small red arrow, you see two strings (hexadecimal notation) that are shown using fonts referenced using the names C2_0and C2_1. Incidentally, these fonts are stored as composite fonts with /SubType0 and /EncodingIdentity-H. This means that the characters used in the hexadecimal string should correspond with the UNICODE values of the glyphs. If that's not the case, you're out of luck.

在内容流(下方)红色小箭头中,您会看到两个字符串(十六进制表示法)使用名称C2_0和引用的字体显示C2_1。顺便说一下,这些字体存储为具有/SubType0 和/EncodingIdentity-H 的复合字体。这意味着十六进制字符串中使用的字符应与字形的 UNICODE 值对应。如果不是这种情况,那你就不走运了。

There seems to be no problem with the font for which the name /TT0is used.

/TT0使用名称的字体似乎没有问题。

The fact that /TT0uses WinAnsiEncoding and the other fonts use Identity-H is irrelevant. There are plentyof PDF files with fonts that use Identity-H of which the text can be copy/pasted or extracted using iTextSharp. Unfortunately, there is probably something wrong with the way your PDF was constructed. It would take too much time to analyze what went wrong, so your best shot is to contact the person who gave you the PDF and to ask him/her to fix the PDF.

/TT0使用 WinAnsiEncoding 而其他字体使用 Identity-H的事实无关紧要。有很多带有使用 Identity-H 字体的 PDF 文件,其中的文本可以使用 iTextSharp 复制/粘贴或提取。不幸的是,您的 PDF 的构建方式可能有问题。分析出了什么问题会花费太多时间,所以你最好的办法是联系给你 PDF 的人,并要求他/她修复 PDF。