如何从 Python 中填写的表单中提取 PDF 字段？

Question

提问by Olson

I'm trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.

我正在尝试使用 Python 来处理一些使用 Adobe Acrobat Reader 填写和签名的 PDF 表单。

I've tried:

我试过了：

The pdfminerdemo: it didn't dump any of the filled out data.
pyPdf: it maxed a core for 2 minutes when I tried to load the file with PdfFileReader(f) and I just gave up and killed it.
Jython and PDFBox: got that working great but the startup time is excessive, I'll just write an external utility in straight Java if that's my only option.

该pdfminer演示：它没有任何倾倒在填写数据。
pyPdf：当我尝试使用 PdfFileReader(f) 加载文件时，它的核心达到了 2 分钟，我只是放弃并杀死了它。
Jython 和PDFBox：运行良好，但启动时间过长，如果这是我唯一的选择，我将直接用 Java 编写外部实用程序。

I can keep hunting for libraries and trying them but I'm hoping someone already has an efficient solution for this.

我可以继续寻找图书馆并尝试它们，但我希望有人已经为此提供了有效的解决方案。

Update:Based on Steven's answer I looked into pdfminer and it did the trick nicely.

更新：根据史蒂文的回答，我查看了 pdfminer，它很好地解决了这个问题。

from argparse import ArgumentParser
import pickle
import pprint
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdftypes import resolve1, PDFObjRef

def load_form(filename):
    """Load pdf form contents into a nested list of name/value tuples"""
    with open(filename, 'rb') as file:
        parser = PDFParser(file)
        doc = PDFDocument()
        parser.set_document(doc)
        doc.set_parser(parser)
        doc.initialize()
        return [load_fields(resolve1(f)) for f in
                   resolve1(doc.catalog['AcroForm'])['Fields']]

def load_fields(field):
    """Recursively load form fields"""
    form = field.get('Kids', None)
    if form:
        return [load_fields(resolve1(f)) for f in form]
    else:
        # Some field types, like signatures, need extra resolving
        return (field.get('T').decode('utf-16'), resolve1(field.get('V')))

def parse_cli():
    """Load command line arguments"""
    parser = ArgumentParser(description='Dump the form contents of a PDF.')
    parser.add_argument('file', metavar='pdf_form',
                    help='PDF Form to dump the contents of')
    parser.add_argument('-o', '--out', help='Write output to file',
                      default=None, metavar='FILE')
    parser.add_argument('-p', '--pickle', action='store_true', default=False,
                      help='Format output for python consumption')
    return parser.parse_args()

def main():
    args = parse_cli()
    form = load_form(args.file)
    if args.out:
        with open(args.out, 'w') as outfile:
            if args.pickle:
                pickle.dump(form, outfile)
            else:
                pp = pprint.PrettyPrinter(indent=2)
                file.write(pp.pformat(form))
    else:
        if args.pickle:
            print pickle.dumps(form)
        else:
            pp = pprint.PrettyPrinter(indent=2)
            pp.pprint(form)

if __name__ == '__main__':
    main()

Answer 1

采纳答案by Steven

You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").

你应该可以用pdfminer 来做，但它需要深入研究 pdfminer 的内部结构和一些关于 pdf 格式的知识（当然是 wrt 形式，但也需要关于 pdf 的内部结构，如“字典”和“间接对象”） .

This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)

这个例子可能对你有所帮助（我认为它只适用于简单的情况，没有嵌套字段等......）

import sys
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

filename = sys.argv[1]
fp = open(filename, 'rb')

parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
    field = resolve1(i)
    name, value = field.get('T'), field.get('V')
    print '{0}: {1}'.format(name, value)

EDIT: forgot to mention: if you need to provide a password, pass it to doc.initialize()

编辑：忘了提及：如果您需要提供密码，请将其传递给 doc.initialize()

Answer 2

回答by Philip

Quick and dirty 2-minute job; just use PDFminerto convert PDF to xml and then grab all of the fields.

快速而肮脏的 2 分钟工作；只需使用PDFminer将 PDF 转换为 xml，然后抓取所有字段。

from xml.etree import ElementTree
from pprint import pprint
import os

def main():
    print "Calling PDFDUMP.py"
    os.system("dumppdf.py -a FILE.pdf > out.xml")

    # Preprocess the file to eliminate bad XML.
    print "Screening the file"
    o = open("output.xml","w") #open for append
    for line in open("out.xml"):
       line = line.replace("&#", "Invalid_XML") #some bad data in xml for formatting info.
       o.write(line) 
    o.close()

    print "Opening XML output"
    tree = ElementTree.parse('output.xml')
    lastnode = ""
    lastnode2 = ""
    list = {}
    entry = {}

    for node in tree.iter(): # Run through the tree..        
        # Check if New node
        if node.tag == "key" and node.text == "T":
            lastnode = node.tag + node.text
        elif lastnode == "keyT":
            for child in node.iter():
                entry["ID"] = child.text
            lastnode = ""

        if node.tag == "key" and node.text == "V":
            lastnode2 = node.tag + node.text
        elif lastnode2 == "keyV":
            for child in node.iter():
                if child.tag == "string":
                    if entry.has_key("ID"):
                        entry["Value"] = child.text
                        list[entry["ID"]] = entry["Value"]
                        entry = {}
            lastnode2 = ""

    pprint(list)

if __name__ == '__main__':
  main()

It isn't pretty, just a simple proof of concept. I need to implement it for a system I'm working on so I will be cleaning it up, but I thought I would post it in case anyone finds it useful.

它并不漂亮，只是一个简单的概念证明。我需要为我正在开发的系统实现它，所以我会清理它，但我想我会发布它，以防有人发现它有用。

Answer 3

回答by vossman77

Update for latest version of pdf miner (change import and parser/doc setup in first function)

更新最新版本的 pdf miner（在第一个函数中更改导入和解析器/文档设置）

from argparse import ArgumentParser
import pickle
import pprint
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
from pdfminer.pdftypes import PDFObjRef

def load_form(filename):
    """Load pdf form contents into a nested list of name/value tuples"""
    with open(filename, 'rb') as file:
        parser = PDFParser(file)
        doc = PDFDocument(parser)
        parser.set_document(doc)
        #doc.set_parser(parser)
        doc.initialize()
        return [load_fields(resolve1(f)) for f in
            resolve1(doc.catalog['AcroForm'])['Fields']]

def load_fields(field):
    """Recursively load form fields"""
    form = field.get('Kids', None)
    if form:
        return [load_fields(resolve1(f)) for f in form]
    else:
        # Some field types, like signatures, need extra resolving
        return (field.get('T').decode('utf-8'), resolve1(field.get('V')))

def parse_cli():
    """Load command line arguments"""
    parser = ArgumentParser(description='Dump the form contents of a PDF.')
    parser.add_argument('file', metavar='pdf_form',
        help='PDF Form to dump the contents of')
    parser.add_argument('-o', '--out', help='Write output to file',
        default=None, metavar='FILE')
    parser.add_argument('-p', '--pickle', action='store_true', default=False,
        help='Format output for python consumption')
    return parser.parse_args()

def main():
    args = parse_cli()
    form = load_form(args.file)
    if args.out:
        with open(args.out, 'w') as outfile:
            if args.pickle:
                pickle.dump(form, outfile)
            else:
                pp = pprint.PrettyPrinter(indent=2)
                file.write(pp.pformat(form))
    else:
        if args.pickle:
            print pickle.dumps(form)
        else:
            pp = pprint.PrettyPrinter(indent=2)
            pp.pprint(form)

if __name__ == '__main__':
    main()

Answer 4

回答by Shane

There is a typo on these lines:

这些行有一个错字：

file.write(pp.pformat(form))

Should be:

应该：

outfile.write(pp.pformat(form))

Answer 5

回答by dvska

Python 3.6+:

Python 3.6+：

pip install PyPDF2

# -*- coding: utf-8 -*-

from collections import OrderedDict
from PyPDF2 import PdfFileWriter, PdfFileReader


def _getFields(obj, tree=None, retval=None, fileobj=None):
    """
    Extracts field data if this PDF contains interactive form fields.
    The *tree* and *retval* parameters are for recursive use.

    :param fileobj: A file object (usually a text file) to write
        a report to on all interactive form fields found.
    :return: A dictionary where each key is a field name, and each
        value is a :class:`Field<PyPDF2.generic.Field>` object. By
        default, the mapping name is used for keys.
    :rtype: dict, or ``None`` if form data could not be located.
    """
    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',
                       '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
    if retval is None:
        retval = OrderedDict()
        catalog = obj.trailer["/Root"]
        # get the AcroForm tree
        if "/AcroForm" in catalog:
            tree = catalog["/AcroForm"]
        else:
            return None
    if tree is None:
        return retval

    obj._checkKids(tree, retval, fileobj)
    for attr in fieldAttributes:
        if attr in tree:
            # Tree is a field
            obj._buildField(tree, retval, fileobj, fieldAttributes)
            break

    if "/Fields" in tree:
        fields = tree["/Fields"]
        for f in fields:
            field = f.getObject()
            obj._buildField(field, retval, fileobj, fieldAttributes)

    return retval


def get_form_fields(infile):
    infile = PdfFileReader(open(infile, 'rb'))
    fields = _getFields(infile)
    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())



if __name__ == '__main__':
    from pprint import pprint

    pdf_file_name = 'FormExample.pdf'

    pprint(get_form_fields(pdf_file_name))

Answer 6

回答by equaeghe

The Python PyPDF2package (successor to pyPdf) is very convenient:

Python PyPDF2包（pyPdf 的继承者）非常方便：

import PyPDF2
f = PyPDF2.PdfFileReader('form.pdf')
ff = f.getFields()

Then ffis a dictthat contains all the relevant form information.

然后ff是dict包含所有相关表单信息的一个。

如何从 Python 中填写的表单中提取 PDF 字段？

提问by Olson

采纳答案by Steven

回答by Philip

回答by vossman77

回答by Shane

回答by dvska

回答by equaeghe

相关推荐

最近更新

标签

如何从 Python 中填写的表单中提取 PDF 字段？

提问by Olson

采纳答案by Steven

回答by Philip

回答by vossman77

回答by Shane

回答by dvska

回答by equaeghe

相关推荐

Python While 循环单行

在 Python 中，如何以可读格式显示当前时间

pip 可以与 Visual Studio 中的 Python 工具一起使用吗？

Python：删除 TKinter 框架

相关推荐

最近更新

标签