从 pdf 中提取表格(到 excel),首选。带 vba

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15045446/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 15:01:19  来源:igfitidea点击:

Extract tables from pdf (to excel), pref. w/ vba

excelvbapdffilesystemobject

提问by MeRuud

I am trying to extract tables from pdf files with vba and export them to excel. If everything works out the way it should, it should go all automatic. The problem is that the table are not standardized.

我正在尝试使用 vba 从 pdf 文件中提取表格并将它们导出到 excel。如果一切都按预期进行,它应该会自动进行。问题是表格没有标准化。

This is what I have so far.

这是我到目前为止。

  1. VBA (Excel) runs XPDF, and converts all .pdf files found in current folder to a text file.
  2. VBA (Excel) reads through each text file line by line.
  1. VBA (Excel) 运行XPDF,并将当前文件夹中找到的所有 .pdf 文件转换为文本文件。
  2. VBA (Excel) 逐行读取每个文本文件。

And the code:

和代码:

With New Scripting.FileSystemObject
With .OpenTextFile(strFileName, 1, False, 0)

    If Not .AtEndOfStream Then .SkipLine
    Do Until .AtEndOfStream
        //do something
    Loop
End With
End With

This all works great. But now I am getting to the issue of extracting the tables from the text files. What I am trying to do is VBA to find a string e.g. "Year's Income", and then output the data, after it, into columns. (Until the table ends.)

这一切都很好。但是现在我要解决从文本文件中提取表格的问题。我想要做的是 VBA 找到一个字符串,例如“Year's Income”,然后将数据输出到列中。(直到桌子结束。)

The first part is not very difficult (find a certain string), but how would I go about the second part. The text file will look like this Pastebin. The problem is that the text is not standardized. Thus for example some tables have 3-year columns (2010 2011 2012) and some only two (or 1), some tables have more spaces between the columnn, and some do not include certain rows (such as Capital Asset, net).

第一部分不是很难(找到某个字符串),但是我将如何处理第二部分。文本文件看起来像这个 Pastebin。问题是文字不规范。因此,例如有些表有 3 年的列(2010 2011 2012),有些只有两个(或 1 个),有些表在列之间有更多的空间,有些不包括某些行(例如资本资产,净值)。

I was thinking about doing something like this but not sure how to go about it in VBA.

我正在考虑做这样的事情,但不知道如何在 VBA 中进行。

  1. Find user defined string. eg. "Table 1: Years' Return."
  2. a. Next line find years; if there are two we will need three columns in output (titles +, 2x year), if there are three we will need four (titles +, 3x year).. etc
    b. Create title column + column for each year.
  3. When reaching end of line, go to next line
  4. a. Read text -> output to column 1.
    b. Recognize spaces (Are spaces > 3?) as start of column 2. Read numbers -> output to column 2.
    c. (if column = 3) Recognize spaces as start of column 3. Read numbers -> output to column 3.
    d. (if column = 4) Recognize spaces as start of column 4. Read numbers -> output to column 4.
  5. Each line, loop 4.
  6. Next line does not include any numbers - End table. (probably the easiet just a user defined number, after 15 characters no number? end table)
  1. 查找用户定义的字符串。例如。“表 1:年的回报。”
  2. 一种。下一行查找年份;如果有两个,我们将需要输出中的三列(标题 +,2x 年),如果有三列,我们将需要四列(标题 +,3x 年)……等等
    b. 为每年创建标题列 + 列。
  3. 当到达行尾时,转到下一行
  4. 一种。阅读文本 -> 输出到第 1 列。
    b. 识别空格(空格 > 3?)作为第 2 列的开始。读取数字 -> 输出到第 2 列。
    c.(如果 column = 3)将空格识别为第 3 列的开始。读取数字 -> 输出到第 3 列
    。(如果列 = 4)将空格识别为第 4 列的开始。读取数字 -> 输出到第 4 列。
  5. 每行,循环 4。
  6. 下一行不包括任何数字 - 结束表。(可能easiet只是一个用户定义的数字,15个字符后没有数字?结束表)

I based my first version on Pdf to excel, but reading online people do not recommend OpenFilebut rather FileSystemObject(even though it seems to be a lot slower).

我将我的第一个版本基于Pdf 以 excel,但在线阅读人们不推荐OpenFile而是FileSystemObject(即使它看起来慢很多)。

Any pointers to get me started, mainly on step 2?

任何让我开始的指示,主要是在第 2 步?

采纳答案by CuberChase

You have a number of ways to dissect a text file and depending on how complex it is might cause you to lean one way or another. I started this and it got a bit out of hand... enjoy.

您有多种分析文本文件的方法,根据它的复杂程度,您可能会倾向于一种或另一种方式。我开始了这个,它有点失控......享受。

Based on the sample you've provided and the additional comments, I noted the following. Some of these may work well for simple files but can get unwieldy with bigger more complex files. Furthermore, there may be slightly more efficient methods or tricks to what I have used here but this will definitely get you going an achieve the desired outcome. Hopefully this makes sense in conjunction with the code provided:

根据您提供的示例和其他评论,我注意到以下内容。其中一些可能适用于简单文件,但对于更大更复杂的文件可能会变得笨拙。此外,我在这里使用的方法或技巧可能稍微更有效,但这肯定会让您达到预期的结果。希望这与提供的代码结合使用是有意义的:

  • You can use booleans to help you determine what 'section' of the text file you are in. Ie use InStron the current line to determine you are in a Table by looking for the text 'Table' and then once you know you are in the 'Table' section of the file start looking for the 'Assets' section etc
  • You can use a few methods to determine the number of years (or columns) you have. The Splitfunction along with a loop will do the job.
  • If your files always have constant formatting, even only in certain parts, you can take advantage of this. For example, if you know your file line will always have a dollar sign in front of the them, then you know this will define the column widths and you can use this on subsequent lines of text.
  • 您可以使用布尔值来帮助您确定您所在的文本文件的哪个“部分”。即InStr在当前行上使用通过查找文本“表格”来确定您在表格中,然后一旦您知道您在文件的“表格”部分开始寻找“资产”部分等
  • 您可以使用几种方法来确定您拥有的年数(或列数)。该Split函数和一个循环将完成这项工作。
  • 如果您的文件始终具有固定格式,即使仅在某些部分,您可以利用这一点。例如,如果您知道您的文件行前面总是有一个美元符号,那么您知道这将定义列宽,您可以在后续文本行中使用它。

The following code will extract the Assets details from the text file, you can mod it to extract other sections. It should handle multiple rows. Hopefully I've commented it sufficient. Have a look and I'll edit if needs to help out further.

以下代码将从文本文件中提取资产详细信息,您可以对其进行修改以提取其他部分。它应该处理多行。希望我的评论已经足够了。看一看,如果需要进一步帮助,我会进行编辑。

 Sub ReadInTextFile()
    Dim fs As Scripting.FileSystemObject, fsFile As Scripting.TextStream
    Dim sFileName As String, sLine As String, vYears As Variant
    Dim iNoColumns As Integer, ii As Integer, iCount As Integer
    Dim bIsTable As Boolean, bIsAssets As Boolean, bIsLiabilities As Boolean, bIsNetAssets As Boolean

    Set fs = CreateObject("Scripting.FileSystemObject")
    sFileName = "G:\Sample.txt"
    Set fsFile = fs.OpenTextFile(sFileName, 1, False)

    'Loop through the file as you've already done
    Do While fsFile.AtEndOfStream <> True
        'Determine flag positions in text file
        sLine = fsFile.Readline

        Debug.Print VBA.Len(sLine)

        'Always skip empty lines (including single spaceS)
        If VBA.Len(sLine) > 1 Then

            'We've found a new table so we can reset the booleans
            If VBA.InStr(1, sLine, "Table") > 0 Then
                bIsTable = True
                bIsAssets = False
                bIsNetAssets = False
                bIsLiabilities = False
                iNoColumns = 0
            End If

            'Perhaps you want to also have some sort of way to designate that a table has finished.  Like so
            If VBA.Instr(1, sLine, "Some text that designates the end of the table") Then
                bIsTable = False
            End If 

            'If we're in the table section then we want to read in the data
            If bIsTable Then
                'Check for your different sections.  You could make this constant if your text file allowed it.
                If VBA.InStr(1, sLine, "Assets") > 0 And VBA.InStr(1, sLine, "Net") = 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = False
                If VBA.InStr(1, sLine, "Liabilities") > 0 Then bIsAssets = False: bIsLiabilities = True: bIsNetAssets = False
                If VBA.InStr(1, sLine, "Net Assests") > 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = True

                'If we haven't triggered any of these booleans then we're at the column headings
                If Not bIsAssets And Not bIsLiabilities And Not bIsNetAssets And VBA.InStr(1, sLine, "Table") = 0 Then
                    'Trim the current line to remove leading and trailing spaces then use the split function to determine the number of years
                    vYears = VBA.Split(VBA.Trim$(sLine), " ")
                    For ii = LBound(vYears) To UBound(vYears)
                        If VBA.Len(vYears(ii)) > 0 Then iNoColumns = iNoColumns + 1
                    Next ii

                    'Now we can redefine some variables to hold the information (you'll want to redim after you've collected the info)
                    ReDim sAssets(1 To iNoColumns + 1, 1 To 100) As String
                    ReDim iColumns(1 To iNoColumns) As Integer
                Else
                    If bIsAssets Then
                        'Skip the heading line
                        If Not VBA.Trim$(sLine) = "Assets" Then
                            'Increment the counter
                            iCount = iCount + 1

                            'If iCount reaches it's limit you'll have to redim preseve you sAssets array (I'll leave this to you)
                            If iCount > 99 Then
                                'You'll find other posts on stackoverflow to do this
                            End If

                            'This will happen on the first row, it'll happen everytime you
                            'hit a $ sign but you could code to only do so the first time
                            If VBA.InStr(1, sLine, "$") > 0 Then
                                iColumns(1) = VBA.InStr(1, sLine, "$")
                                For ii = 2 To iNoColumns
                                    'We need to start at the next character across
                                    iColumns(ii) = VBA.InStr(iColumns(ii - 1) + 1, sLine, "$")
                                Next ii
                            End If

                            'The first part (the name) is simply up to the $ sign (trimmed of spaces)
                            sAssets(1, iCount) = VBA.Trim$(VBA.Mid$(sLine, 1, iColumns(1) - 1))
                            For ii = 2 To iNoColumns
                                'Then we can loop around for the rest
                                sAssets(ii, iCount) = VBA.Trim$(VBA.Mid$(sLine, iColumns(ii) + 1, iColumns(ii) - iColumns(ii - 1)))
                            Next ii

                            'Now do the last column
                            If VBA.Len(sLine) > iColumns(iNoColumns) Then
                                sAssets(iNoColumns + 1, iCount) = VBA.Trim$(VBA.Right$(sLine, VBA.Len(sLine) - iColumns(iNoColumns)))
                            End If
                        Else
                            'Reset the counter
                            iCount = 0
                        End If
                    End If
                End If

            End If
        End If
    Loop

    'Clean up
    fsFile.Close
    Set fsFile = Nothing
    Set fs = Nothing
End Sub

回答by Cyraneau de Beargerac

I cannot examine the sample data as the PasteBin has been removed. Based on what I can glean from the problem description, it seems to me that using Regular Expressions would make parsing the data much easier.

我无法检查示例数据,因为 PasteBin 已被删除。根据我从问题描述中收集到的信息,在我看来,使用正则表达式会使解析数据变得更加容易。

Add a reference to the Scripting Runtime scrrun.dll for the FileSystemObject.
Add a reference to the Microsoft VBScript Regular Expressions 5.5. library for the RegExp object.

为 FileSystemObject 添加对脚本运行时 scrrun.dll 的引用。
添加对 Microsoft VBScript 正则表达式 5.5 的引用。RegExp 对象的库。

Instantiate a RegEx object with Dim objRE As New RegExp

使用 Dim objRE As New RegExp 实例化一个 RegEx 对象

Set the Pattern property to "(\bd{4}\b){1,3}" The above pattern should match on lines containing strings like: 2010 2010 2011 2010 2011 2012

将 Pattern 属性设置为 "(\bd{4}\b){1,3}" 上述模式应该匹配包含如下字符串的行: 2010 2010 2011 2010 2011 2012

The number of spaces between the year strings is irrelevant, as long as there is at least one (since we're not expecting to encounter strings like 201020112012 for example)

年份字符串之间的空格数无关紧要,只要至少有一个(因为我们不希望遇到像 201020112012 这样的字符串)

Set the Global property to True

将全局属性设置为 True

The captured groups will be found in the individual Match objects from the MatchCollection returned by the Execute method of the RegEx object objRE. So declare the appropriate objects:

捕获的组将在 RegEx 对象 objRE 的 Execute 方法返回的 MatchCollection 中的各个 Match 对象中找到。所以声明适当的对象:

Dim objMatches as MatchCollection
Dim objMatch as Match
Dim intMatchCount 'tells you how many year strings were found, if any

Assuming you've set up a FileSystemObject object and are scanning the text file, reading each line into a variable strLine

假设您已经设置了一个 FileSystemObject 对象并且正在扫描文本文件,将每一行读入一个变量 strLine

First test to see if the current line contains the pattern sought:

首先测试当前行是否包含所寻求的模式:

If objRE.Test(strLine) Then
  'do something
Else
  'skip over this line
End If

Set objMatches = objRe.Execute(strLine)
intMatchCount = objMatches.Count

For i = 0 To intMatchCount - 1
   'processing code such as writing the years as column headings in Excel
    Set objMatch = objMatches(i)
    e.g. ActiveCell.Value = objMatch.Value
   'subsequent lines beneath the line containing the year strings should
   'have the amounts, which may be captured in a similar fashion using an
   'additional RegExp object and a Pattern such as "(\b\d+\b){1,3}" for
   'whole numbers or "(\b\d+\.\d+\b){1,3}" for floats. For currency, you
   'can use "(\b$\d+\.\d{2}\b){1,3}"
Next i

This is just a rough outline of how I would approach this challenge. I hope there is something in this code outline that will be of help to you.

这只是我将如何应对这一挑战的粗略概述。我希望此代码大纲中的某些内容对您有所帮助。

回答by Morris Lancaster

Another way to do this I have some success with is to use VBA to convert to a .doc or .docx file and then search for and pull tables from the Word file. They can be easily extracted into Excel sheets. The conversion seems to handle tables nicely. Note however that it works on a page by page basis so tables extending over a page end up as separate tables in the word doc.

另一种我取得了一些成功的方法是使用 VBA 转换为 .doc 或 .docx 文件,然后从 Word 文件中搜索并提取表格。它们可以很容易地提取到 Excel 表格中。转换似乎可以很好地处理表格。但是请注意,它在一页一页的基础上工作,因此在一页上延伸的表格最终会作为单词 doc 中的单独表格。