vba 从 PDF 中提取数据并添加到工作表

Question

提问by Will Bell

I am trying to extract the data from a PDF document into a worksheet. The PDFs show and text can be manually copied and pasted into the Excel document.

我正在尝试将 PDF 文档中的数据提取到工作表中。PDF 显示和文本可以手动复制并粘贴到 Excel 文档中。

I am currently doing this through SendKeys and it is not working. I get an error when I try to paste the data from the PDF document. Why is my paste not working? If I paste after the macro has stopped running it pastes as normal.

我目前正在通过 SendKeys 执行此操作，但它不起作用。当我尝试粘贴 PDF 文档中的数据时出现错误。为什么我的粘贴不起作用？如果我在宏停止运行后粘贴，它会正常粘贴。

Dim myPath As String, myExt As String
Dim ws As Worksheet
Dim openPDF As Object
'Dim pasteData As MSForms.DataObject
Dim fCell As Range

'Set pasteData = New MSForms.DataObject
Set ws = Sheets("DATA")
If ws.Cells(ws.Rows.Count, "A").End(xlUp).Row > 1 Then Range("A3:A" & ws.Cells(ws.Rows.Count, "A").End(xlUp).Row).ClearContents

myExt = "\*.pdf"
'When Scan Receipts Button Pressed Scan the selected folder/s for receipts
For Each fCell In Range(ws.Cells(1, 1), ws.Cells(1, ws.Cells(1, ws.Columns.Count).End(xlToLeft).Column))
    myPath = Dir(fCell.Value & myExt)
    Do While myPath <> ""
        myPath = fCell.Value & "\" & myPath
        Set openPDF = CreateObject("Shell.Application")
        openPDF.Open (myPath)
        Application.Wait Now + TimeValue("00:00:2")
        SendKeys "^a"
        Application.Wait Now + TimeValue("00:00:2")
        SendKeys "^c"
        'Application.Wait Now + TimeValue("00:00:2")
        ws.Select
        ActiveSheet.Paste
        'pasteData.GetFromClipboard

        'ws.Cells(3, 1) = pasteData.GetText
        Exit Sub

        myPath = Dir
    Loop

Next fCell

Answer 1

回答by leowyn

You can open the PDF file and extract its contents using the Adobe library (which I believe you can download from Adobe as part of the SDK, but it comes with certain versions of Acrobat as well)

您可以打开 PDF 文件并使用 Adobe 库提取其内容（我相信您可以从 Adobe 下载它作为 SDK 的一部分，但它也随某些版本的 Acrobat 一起提供）

Make sure to add the Library to your references too (On my machine it is the Adobe Acrobat 10.0 Type Library, but not sure if that is the newest version)

确保也将库添加到您的参考文献中（在我的机器上它是 Adobe Acrobat 10.0 类型库，但不确定那是否是最新版本）

Even with the Adobe library it is not trivial (you'll need to add your own error-trapping etc):

即使使用 Adobe 库，它也不是微不足道的（您需要添加自己的错误捕获等）：

Function getTextFromPDF(ByVal strFilename As String) As String
   Dim objAVDoc As New AcroAVDoc
   Dim objPDDoc As New AcroPDDoc
   Dim objPage As AcroPDPage
   Dim objSelection As AcroPDTextSelect
   Dim objHighlight As AcroHiliteList
   Dim pageNum As Long
   Dim strText As String

   strText = ""
   If (objAvDoc.Open(strFilename, "") Then
      Set objPDDoc = objAVDoc.GetPDDoc
      For pageNum = 0 To objPDDoc.GetNumPages() - 1
         Set objPage = objPDDoc.AcquirePage(pageNum)
         Set objHighlight = New AcroHiliteList
         objHighlight.Add 0, 10000 ' Adjust this up if it's not getting all the text on the page
         Set objSelection = objPage.CreatePageHilite(objHighlight)

         If Not objSelection Is Nothing Then
            For tCount = 0 To objSelection.GetNumText - 1
               strText = strText & objSelection.GetText(tCount)
            Next tCount
         End If
      Next pageNum
      objAVDoc.Close 1
   End If

   getTextFromPDF = strText

End Function

What this does is essentially the same thing you are trying to do - only using Adobe's own library. It's going through the PDF one page at a time, highlighting all of the text on the page, then dropping it (one text element at a time) into a string.

它所做的基本上与您尝试做的事情相同 - 仅使用 Adobe 自己的库。它一次一页地浏览 PDF，突出显示页面上的所有文本，然后将其（一次一个文本元素）放入一个字符串中。

Keep in mind what you get from this could be full of all kinds of non-printing characters (line feeds, newlines, etc) that could even end up in the middle of what look like contiguous blocks of text, so you may need additional code to clean it up before you can use it.

请记住，您从中得到的内容可能充满各种非打印字符（换行符、换行符等），它们甚至可能位于看起来像连续文本块的中间，因此您可能需要额外的代码在您可以使用它之前清理它。

Hope that helps!

希望有帮助！

Answer 2

回答by expodavid

I know this is an old issue but I just had to do this for a project at work, and I am very surprised that nobody has thought of this solution yet: Just open the .pdf with Microsoft word.

我知道这是一个老问题，但我只需要为工作中的项目执行此操作，我很惊讶还没有人想到此解决方案： 只需使用 Microsoft Word 打开 .pdf。

The code is a lot easier to work with when you are trying to extract data from a .docx because it opens in Microsoft Word. Excel and Word play well together because they are both Microsoft programs. In my case, the file of question hadto be a .pdf file. Here's the solution I came up with:

当您尝试从 .docx 中提取数据时，该代码更易于使用，因为它在 Microsoft Word 中打开。Excel 和 Word 可以很好地配合使用，因为它们都是 Microsoft 程序。在我的情况下，问题的文件必须是.pdf文件。这是我想出的解决方案：

Choose the default program to open .pdf files to be Microsoft Word
The first time you open a .pdf file with word, a dialogue box pops up claiming word will need to convert the .pdf into a .docx file. Click the check box in the bottom left stating "do not show this message again" and then click OK.
Create a macro that extracts data from a .docx file. I used MikeD's Codeas a resource for this.
Tinker around with the MoveDown, MoveRight, and Find.Execute methods to fit the need of your task.

选择打开 .pdf 文件的默认程序为 Microsoft Word
第一次用word 打开.pdf 文件时，会弹出一个对话框，提示word 需要将.pdf 转换为.docx 文件。单击左下角的“不再显示此消息”复选框，然后单击“确定”。
创建一个从 .docx 文件中提取数据的宏。为此，我使用了MikeD 的代码作为资源。
修改 MoveDown、MoveRight 和 Find.Execute 方法以满足您的任务需要。

Yes you could just convert the .pdf file to a .docx file but this is a much simpler solution in my opinion.

是的，您可以将 .pdf 文件转换为 .docx 文件，但在我看来，这是一个更简单的解决方案。

Answer 3

回答by RIBH

Over time, I have found that extracting text from PDFs in a structured format is tough business. However if you are looking for an easy solution, you might want to consider XPDFtool pdftotext.

随着时间的推移，我发现以结构化格式从 PDF 中提取文本是一项艰巨的任务。但是，如果您正在寻找一个简单的解决方案，您可能需要考虑XPDFtool pdftotext。

Pseudocode to extract the text would include:

提取文本的伪代码包括：

Using SHELLVBA statement to extract the text from PDF to a temporary file using XPDF
Using sequential file read statements to read the temporary file contents into a string
Pasting the string into Excel

使用SHELLVBA 语句将文本从 PDF 提取到使用XPDF的临时文件
使用顺序文件读取语句将临时文件内容读入字符串
将字符串粘贴到 Excel 中

Simplified example below:

下面的简化示例：

    Sub ReadIntoExcel(PDFName As String)
        'Convert PDF to text
        Shell "C:\Utils\pdftotext.exe -layout " & PDFName & " tempfile.txt"

        'Read in the text file and write to Excel
        Dim TextLine as String
        Dim RowNumber as Integer
        Dim F1 as Integer
        RowNumber = 1
        F1 = Freefile()
        Open "tempfile.txt" for Input as #F1
            While Not EOF(#F1)
                Line Input #F1, TextLine
                ThisWorkbook.WorkSheets(1).Cells(RowNumber, 1).Value = TextLine
                RowNumber = RowNumber + 1
            Wend
        Close #F1
    End Sub

Answer 4

回答by Eugene

Copying and pasting by user interactions emulation could be not reliable (for example, popup appears and it switches the focus). You may be interested in trying the commercial ByteScout PDF Extractor SDKthat is specifically designed to extract data from PDF and it works from VBA. It is also capable of extracting data from invoices and tables as CSV using VB code.

通过用户交互仿真进行复制和粘贴可能不可靠（例如，出现弹出窗口并切换焦点）。您可能有兴趣尝试商业ByteScout PDF Extractor SDK，该SDK专门设计用于从 PDF 中提取数据，并且它适用于 VBA。它还能够使用VB 代码从发票和表格中以 CSV 格式提取数据。

Here is the VBA code for Excel to extract text from given locations and save them into cells in the Sheet1:

这是 Excel 的 VBA 代码，用于从给定位置提取文本并将它们保存到以下单元格中Sheet1：

Private Sub CommandButton1_Click()

' Create TextExtractor object
' Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor")
Dim extractor As New Bytescout_PDFExtractor.TextExtractor

extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"

' Load sample PDF document
extractor.LoadDocumentFromFile ("c:\sample1.pdf")

' Get page count
pageCount = extractor.GetPageCount()

Dim wb As Workbook
Dim ws As Worksheet
Dim TxtRng  As Range

Set wb = ActiveWorkbook
Set ws = wb.Sheets("Sheet1")

For i = 0 To pageCount - 1
            RectLeft = 10
            RectTop = 10
            RectWidth = 100
            RectHeight = 100

            ' check the same text is extracted from returned coordinates
            extractor.SetExtractionArea RectLeft, RectTop, RectWidth, RectHeight
            ' extract text from given area
            extractedText = extractor.GetTextFromPage(i)

            ' insert rows
            ' Rows(1).Insert shift:=xlShiftDown
            ' write cell value
             Set TxtRng = ws.Range("A" & CStr(i + 2))
             TxtRng.Value = extractedText

Next

Set extractor = Nothing


End Sub

Disclosure: I am related to ByteScout

披露：我与 ByteScout 有关

Answer 5

回答by AbhishekTripathi

Using Bytescout PDF Extractor SDKis a good option. It is cheap and gives plenty of PDF related functionality. One of the answers above points to the dead page Bytescout on GitHub. I am providing a relevant working sample to extract table from PDF. You may use it to export in any format.

使用Bytescout PDF Extractor SDK是一个不错的选择。它很便宜，并提供了大量与 PDF 相关的功能。上面的答案之一指向 GitHub 上的死页 Bytescout。我提供了一个相关的工作样本来从 PDF 中提取表格。您可以使用它以任何格式导出。

Set extractor = CreateObject("Bytescout.PDFExtractor.StructuredExtractor")

extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"

' Load sample PDF document
extractor.LoadDocumentFromFile "../../sample3.pdf"

For ipage = 0 To extractor.GetPageCount() - 1 

    ' starting extraction from page #"
    extractor.PrepareStructure ipage

    rowCount = extractor.GetRowCount(ipage)

    For row = 0 To rowCount - 1 
        columnCount = extractor.GetColumnCount(ipage, row)

        For col = 0 To columnCount-1
            WScript.Echo "Cell at page #" +CStr(ipage) + ", row=" & CStr(row) & ", column=" & _
                CStr(col) & vbCRLF & extractor.GetCellValue(ipage, row, col)
        Next
    Next
Next

Many more samples available here: https://github.com/bytescout/pdf-extractor-sdk-samples

此处提供更多示例：https: //github.com/bytescout/pdf-extractor-sdk-samples

Answer 6

回答by Slinky Sloth

Since I do not prefer to rely on external libraries and/or other programs, I have extended your solution so that it works. The actual change here is using the GetFromClipboardfunction instead of Pastewhich is mainly used to paste a range of cells. Of course, the downside is that the user must not change focus or intervene during the whole process.

由于我不喜欢依赖外部库和/或其他程序，因此我扩展了您的解决方案以使其有效。这里的实际更改是使用GetFromClipboard函数而不是Paste，后者主要用于粘贴一系列单元格。当然，缺点是用户不能在整个过程中改变焦点或干预。

Dim pathPDF As String, textPDF As String
Dim openPDF As Object
Dim objPDF As MsForms.DataObject

pathPDF = "C:\some\path\data.pdf"
Set openPDF = CreateObject("Shell.Application")
openPDF.Open (pathPDF)
'TIME TO WAIT BEFORE/AFTER COPY AND PASTE SENDKEYS
Application.Wait Now + TimeValue("00:00:2")
SendKeys "^a"
Application.Wait Now + TimeValue("00:00:2")
SendKeys "^c"
Application.Wait Now + TimeValue("00:00:1")

AppActivate ActiveWorkbook.Windows(1).Caption
objPDF.GetFromClipboard
textPDF = objPDF.GetText(1)
MsgBox textPDF

If you're interested see my project in github.

如果您有兴趣，请在github 中查看我的项目。

Answer 7

回答by eugene michaux

To improve the solution of Slinky Sloth I had to add this beforere get from clipboard :

为了改进 Slinky Sloth 的解决方案，我必须在从剪贴板获取之前添加这个：

Set objPDF = New MSForms.DataObject

Sadly it didn't worked for a pdf of 10 pages.

遗憾的是，它不适用于 10 页的 pdf。

Answer 8

回答by Bob Nunemaker

This doesn't seem to work with the Adobe Type library. As soon as it gets to Open, I get a 429 error. Acrobat works fine though...

这似乎不适用于 Adobe Type 库。一旦打开，我就会收到 429 错误。虽然 Acrobat 工作正常...

vba 从 PDF 中提取数据并添加到工作表

提问by Will Bell

回答by leowyn

回答by expodavid

回答by RIBH

回答by Eugene

回答by AbhishekTripathi

回答by Slinky Sloth

回答by eugene michaux

回答by Bob Nunemaker

相关推荐

最近更新

标签

vba 从 PDF 中提取数据并添加到工作表

提问by Will Bell

回答by leowyn

回答by expodavid

回答by RIBH

回答by Eugene

回答by AbhishekTripathi

回答by Slinky Sloth

回答by eugene michaux

回答by Bob Nunemaker

相关推荐

vba 检查值是否存在于范围内而不循环

vba 如何在Excel中获取范围的大小

vba 将包含文件的文件夹解压缩到所选位置

Microsoft Access VBA 编辑 SQL 查询

相关推荐

最近更新

标签