如何从 Python 中填写的表单中提取 PDF 字段?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3984003/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract PDF fields from a filled out form in Python?
提问by Olson
I'm trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.
我正在尝试使用 Python 来处理一些使用 Adobe Acrobat Reader 填写和签名的 PDF 表单。
I've tried:
我试过了:
- The pdfminerdemo: it didn't dump any of the filled out data.
- pyPdf: it maxed a core for 2 minutes when I tried to load the file with PdfFileReader(f) and I just gave up and killed it.
- Jython and PDFBox: got that working great but the startup time is excessive, I'll just write an external utility in straight Java if that's my only option.
- 该pdfminer演示:它没有任何倾倒在填写数据。
- pyPdf:当我尝试使用 PdfFileReader(f) 加载文件时,它的核心达到了 2 分钟,我只是放弃并杀死了它。
- Jython 和PDFBox:运行良好,但启动时间过长,如果这是我唯一的选择,我将直接用 Java 编写外部实用程序。
I can keep hunting for libraries and trying them but I'm hoping someone already has an efficient solution for this.
我可以继续寻找图书馆并尝试它们,但我希望有人已经为此提供了有效的解决方案。
Update:Based on Steven's answer I looked into pdfminer and it did the trick nicely.
更新:根据史蒂文的回答,我查看了 pdfminer,它很好地解决了这个问题。
from argparse import ArgumentParser
import pickle
import pprint
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdftypes import resolve1, PDFObjRef
def load_form(filename):
"""Load pdf form contents into a nested list of name/value tuples"""
with open(filename, 'rb') as file:
parser = PDFParser(file)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()
return [load_fields(resolve1(f)) for f in
resolve1(doc.catalog['AcroForm'])['Fields']]
def load_fields(field):
"""Recursively load form fields"""
form = field.get('Kids', None)
if form:
return [load_fields(resolve1(f)) for f in form]
else:
# Some field types, like signatures, need extra resolving
return (field.get('T').decode('utf-16'), resolve1(field.get('V')))
def parse_cli():
"""Load command line arguments"""
parser = ArgumentParser(description='Dump the form contents of a PDF.')
parser.add_argument('file', metavar='pdf_form',
help='PDF Form to dump the contents of')
parser.add_argument('-o', '--out', help='Write output to file',
default=None, metavar='FILE')
parser.add_argument('-p', '--pickle', action='store_true', default=False,
help='Format output for python consumption')
return parser.parse_args()
def main():
args = parse_cli()
form = load_form(args.file)
if args.out:
with open(args.out, 'w') as outfile:
if args.pickle:
pickle.dump(form, outfile)
else:
pp = pprint.PrettyPrinter(indent=2)
file.write(pp.pformat(form))
else:
if args.pickle:
print pickle.dumps(form)
else:
pp = pprint.PrettyPrinter(indent=2)
pp.pprint(form)
if __name__ == '__main__':
main()
采纳答案by Steven
You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").
你应该可以用pdfminer 来做,但它需要深入研究 pdfminer 的内部结构和一些关于 pdf 格式的知识(当然是 wrt 形式,但也需要关于 pdf 的内部结构,如“字典”和“间接对象”) .
This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)
这个例子可能对你有所帮助(我认为它只适用于简单的情况,没有嵌套字段等......)
import sys
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
filename = sys.argv[1]
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
print '{0}: {1}'.format(name, value)
EDIT: forgot to mention: if you need to provide a password, pass it to doc.initialize()
编辑:忘了提及:如果您需要提供密码,请将其传递给 doc.initialize()
回答by Philip
Quick and dirty 2-minute job; just use PDFminerto convert PDF to xml and then grab all of the fields.
快速而肮脏的 2 分钟工作;只需使用PDFminer将 PDF 转换为 xml,然后抓取所有字段。
from xml.etree import ElementTree
from pprint import pprint
import os
def main():
print "Calling PDFDUMP.py"
os.system("dumppdf.py -a FILE.pdf > out.xml")
# Preprocess the file to eliminate bad XML.
print "Screening the file"
o = open("output.xml","w") #open for append
for line in open("out.xml"):
line = line.replace("&#", "Invalid_XML") #some bad data in xml for formatting info.
o.write(line)
o.close()
print "Opening XML output"
tree = ElementTree.parse('output.xml')
lastnode = ""
lastnode2 = ""
list = {}
entry = {}
for node in tree.iter(): # Run through the tree..
# Check if New node
if node.tag == "key" and node.text == "T":
lastnode = node.tag + node.text
elif lastnode == "keyT":
for child in node.iter():
entry["ID"] = child.text
lastnode = ""
if node.tag == "key" and node.text == "V":
lastnode2 = node.tag + node.text
elif lastnode2 == "keyV":
for child in node.iter():
if child.tag == "string":
if entry.has_key("ID"):
entry["Value"] = child.text
list[entry["ID"]] = entry["Value"]
entry = {}
lastnode2 = ""
pprint(list)
if __name__ == '__main__':
main()
It isn't pretty, just a simple proof of concept. I need to implement it for a system I'm working on so I will be cleaning it up, but I thought I would post it in case anyone finds it useful.
它并不漂亮,只是一个简单的概念证明。我需要为我正在开发的系统实现它,所以我会清理它,但我想我会发布它,以防有人发现它有用。
回答by vossman77
Update for latest version of pdf miner (change import and parser/doc setup in first function)
更新最新版本的 pdf miner(在第一个函数中更改导入和解析器/文档设置)
from argparse import ArgumentParser
import pickle
import pprint
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
from pdfminer.pdftypes import PDFObjRef
def load_form(filename):
"""Load pdf form contents into a nested list of name/value tuples"""
with open(filename, 'rb') as file:
parser = PDFParser(file)
doc = PDFDocument(parser)
parser.set_document(doc)
#doc.set_parser(parser)
doc.initialize()
return [load_fields(resolve1(f)) for f in
resolve1(doc.catalog['AcroForm'])['Fields']]
def load_fields(field):
"""Recursively load form fields"""
form = field.get('Kids', None)
if form:
return [load_fields(resolve1(f)) for f in form]
else:
# Some field types, like signatures, need extra resolving
return (field.get('T').decode('utf-8'), resolve1(field.get('V')))
def parse_cli():
"""Load command line arguments"""
parser = ArgumentParser(description='Dump the form contents of a PDF.')
parser.add_argument('file', metavar='pdf_form',
help='PDF Form to dump the contents of')
parser.add_argument('-o', '--out', help='Write output to file',
default=None, metavar='FILE')
parser.add_argument('-p', '--pickle', action='store_true', default=False,
help='Format output for python consumption')
return parser.parse_args()
def main():
args = parse_cli()
form = load_form(args.file)
if args.out:
with open(args.out, 'w') as outfile:
if args.pickle:
pickle.dump(form, outfile)
else:
pp = pprint.PrettyPrinter(indent=2)
file.write(pp.pformat(form))
else:
if args.pickle:
print pickle.dumps(form)
else:
pp = pprint.PrettyPrinter(indent=2)
pp.pprint(form)
if __name__ == '__main__':
main()
回答by Shane
There is a typo on these lines:
这些行有一个错字:
file.write(pp.pformat(form))
Should be:
应该:
outfile.write(pp.pformat(form))
回答by dvska
Python 3.6+:
Python 3.6+:
pip install PyPDF2
pip install PyPDF2
# -*- coding: utf-8 -*-
from collections import OrderedDict
from PyPDF2 import PdfFileWriter, PdfFileReader
def _getFields(obj, tree=None, retval=None, fileobj=None):
"""
Extracts field data if this PDF contains interactive form fields.
The *tree* and *retval* parameters are for recursive use.
:param fileobj: A file object (usually a text file) to write
a report to on all interactive form fields found.
:return: A dictionary where each key is a field name, and each
value is a :class:`Field<PyPDF2.generic.Field>` object. By
default, the mapping name is used for keys.
:rtype: dict, or ``None`` if form data could not be located.
"""
fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',
'/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
if retval is None:
retval = OrderedDict()
catalog = obj.trailer["/Root"]
# get the AcroForm tree
if "/AcroForm" in catalog:
tree = catalog["/AcroForm"]
else:
return None
if tree is None:
return retval
obj._checkKids(tree, retval, fileobj)
for attr in fieldAttributes:
if attr in tree:
# Tree is a field
obj._buildField(tree, retval, fileobj, fieldAttributes)
break
if "/Fields" in tree:
fields = tree["/Fields"]
for f in fields:
field = f.getObject()
obj._buildField(field, retval, fileobj, fieldAttributes)
return retval
def get_form_fields(infile):
infile = PdfFileReader(open(infile, 'rb'))
fields = _getFields(infile)
return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())
if __name__ == '__main__':
from pprint import pprint
pdf_file_name = 'FormExample.pdf'
pprint(get_form_fields(pdf_file_name))

