python 从 pdf 解析注释
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1106098/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse annotations from a pdf
提问by davidb
I want a python function that takes a pdf and returns a list of the text of the note annotations in the document. I have looked at python-poppler (https://code.launchpad.net/~poppler-python/poppler-python/trunk) but I can not figure out how to get it to give me anything useful.
我想要一个 python 函数,它接受一个 pdf 并返回文档中注释注释的文本列表。我看过 python-poppler ( https://code.launchpad.net/~poppler-python/poppler-python/trunk),但我不知道如何让它给我任何有用的东西。
I found the get_annot_mapping
method and modified the demo program provided to call it via self.current_page.get_annot_mapping()
, but I have no idea what to do with an AnnotMapping object. It seems to not be fully implemented, providing only the copy method.
我找到了该get_annot_mapping
方法并修改了提供的演示程序以通过 调用它self.current_page.get_annot_mapping()
,但我不知道如何处理 AnnotMapping 对象。好像还没有完全实现,只提供了copy方法。
If there are any other libraries that provide this function, that's fine as well.
如果有任何其他库提供此功能,那也很好。
采纳答案by davidb
Turns out the bindings were incomplete. It is now fixed. https://bugs.launchpad.net/poppler-python/+bug/397850
结果发现绑定不完整。现在已修复。https://bugs.launchpad.net/poppler-python/+bug/397850
回答by Enno Gr?per
Just in case somebody is looking for some working code. Here is a script I use.
以防万一有人正在寻找一些工作代码。这是我使用的脚本。
import poppler
import sys
import urllib
import os
def main():
input_filename = sys.argv[1]
# http://blog.hartwork.org/?p=612
document = poppler.document_new_from_file('file://%s' % \
urllib.pathname2url(os.path.abspath(input_filename)), None)
n_pages = document.get_n_pages()
all_annots = 0
for i in range(n_pages):
page = document.get_page(i)
annot_mappings = page.get_annot_mapping ()
num_annots = len(annot_mappings)
if num_annots > 0:
for annot_mapping in annot_mappings:
if annot_mapping.annot.get_annot_type().value_name != 'POPPLER_ANNOT_LINK':
all_annots += 1
print 'page: {0:3}, {1:10}, type: {2:10}, content: {3}'.format(i+1, annot_mapping.annot.get_modified(), annot_mapping.annot.get_annot_type().value_nick, annot_mapping.annot.get_contents())
if all_annots > 0:
print str(all_annots) + " annotation(s) found"
else:
print "no annotations found"
if __name__ == "__main__":
main()
回答by mxl
You should DEFINITELY have a look at PyPDF2
. This amazing library has incredible potential, you can extract whatever from a PDF, including images or comments. Try to start by examining what Acrobat Reader DC (Reader) can give you on a PDF's comments. Take a simple PDF, annotate it (add some comments) with Reader and in the comments tab in the upper right corner, click the horizontal three dots and click Export All To Data File...
and select the format with the extension xfdf
. This creates a wonderful xml file which you can parse. The format is very transparent and self-evident.
你绝对应该看看PyPDF2
。这个惊人的库具有令人难以置信的潜力,您可以从 PDF 中提取任何内容,包括图像或评论。尝试首先检查 Acrobat Reader DC (Reader) 可以为您提供的 PDF 注释。拿一个简单的PDF,用Reader给它批注(添加一些注释),在右上角的注释选项卡中,单击水平三个点,然后单击Export All To Data File...
并选择带有扩展名的格式xfdf
。这将创建一个美妙的 xml 文件,您可以解析它。格式非常透明,不言而喻。
If, however, you cannot rely on a user clicking this and instead need to extract the same data from a PDF programmatically using python, do not despair, there is a solution. (Inspired by Extract images from PDF without resampling, in python?)
但是,如果您不能依赖用户单击它,而是需要使用 python 以编程方式从 PDF 中提取相同的数据,请不要绝望,有一个解决方案。(灵感来自于从 PDF 中提取图像而无需重新采样,在 python 中?)
Prerequisites:
先决条件:
PyPDF2 (pip install PyPDF2
)
PyPDF2 ( pip install PyPDF2
)
What Reader gives you in the above mentioned xfdf file, looks like this:
Reader 在上述 xfdf 文件中为您提供的内容如下所示:
<?xml version="1.0" ?>
<xfdf xml:space="preserve" xmlns="http://ns.adobe.com/xfdf/">
<annots>
<caret IT="Replace" color="#0000FF" creationdate="D:20190221151519+01'00'" date="D:20190221151526+01'00'" flags="print" fringe="1.069520,1.069520,1.069520,1.069520" name="72f8d1b7-d878-4281-bd33-3a6fb4578673" page="0" rect="636.942000,476.891000,652.693000,489.725000" subject="Inserted Text" title="Admin">
<contents-richtext>
<body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
<p dir="ltr">
<span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"> comment1</span>
</p>
</body>
</contents-richtext>
<popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,374.656000,941.008000,488.656000"/>
</caret>
<highlight color="#FFD100" coords="183.867000,402.332000,220.968000,402.332000,183.867000,387.587000,220.968000,387.587000" creationdate="D:20190221151441+01'00'" date="D:20190221151448+01'00'" flags="print" name="a18c7fb0-0af3-435e-8c32-1af2af3c46ea" opacity="0.399994" page="0" rect="179.930000,387.126000,224.904000,402.793000" subject="Highlight" title="Admin">
<contents-richtext>
<body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
<p dir="ltr">
<span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span>
</p>
</body>
</contents-richtext>
<popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,288.332000,941.008000,402.332000"/>
</highlight>
<caret color="#0000FF" creationdate="D:20190221151452+01'00'" date="D:20190221151452+01'00'" flags="print" fringe="0.828156,0.828156,0.828156,0.828156" name="6bf0226e-a3fb-49bf-bc89-05bb671e1627" page="0" rect="285.877000,372.978000,298.073000,382.916000" subject="Inserted Text" title="Admin">
<popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,268.088000,941.008000,382.088000"/>
</caret>
<strikeout IT="StrikeOutTextEdit" color="#0000FF" coords="588.088000,497.406000,644.818000,497.406000,588.088000,477.960000,644.818000,477.960000" creationdate="D:20190221151519+01'00'" date="D:20190221151519+01'00'" flags="print" inreplyto="72f8d1b7-d878-4281-bd33-3a6fb4578673" name="6686b852-3924-4252-af21-c1b10390841f" page="0" rect="582.290000,476.745000,650.616000,498.621000" replyType="group" subject="Cross-Out" title="Admin">
<popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,383.406000,941.008000,497.406000"/>
</strikeout>
</annots>
<f href="p1.pdf"/>
<ids modified="ABB10FA107DAAA47822FB5D311112349" original="474F087D87E7E544F6DEB9E0A93ADFB2"/>
</xfdf>
Various types of comments are presented here as tags within an <annots>
block. Python can give you almost the same data. To obtain it, have a look at what the output of the following script gives:
各种类型的注释在此处显示为<annots>
块中的标签。Python 可以为您提供几乎相同的数据。要获得它,请查看以下脚本的输出结果:
import sys
import PyPDF2, traceback
try :
src = sys.argv[1]
except :
src = r'/path/to/my/file.pdf'
input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
for i in range(nPages) :
page0 = input1.getPage(i)
try :
for annot in page0['/Annots'] :
print annot.getObject() # (1)
print ''
except :
# there are no annotations on this page
pass
The output for the same file as in the xfdf file above will look like this:
与上面的 xfdf 文件相同的文件的输出将如下所示:
{'/Popup': IndirectObject(192, 0), '/M': u"D:20190221151448+01'00'", '/CreationDate': u"D:20190221151441+01'00'", '/NM': u'a18c7fb0-0af3-435e-8c32-1af2af3c46ea', '/F': 4, '/C': [1, 0.81961, 0], '/Rect': [179.93, 387.126, 224.904, 402.793], '/Type': '/Annot', '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u'otrasneho', '/QuadPoints': [183.867, 402.332, 220.968, 402.332, 183.867, 387.587, 220.968, 387.587], '/Subj': u'Highlight', '/CA': 0.39999, '/AP': {'/N': IndirectObject(202, 0)}, '/Subtype': '/Highlight'}
{'/Parent': IndirectObject(191, 0), '/Rect': [737.008, 288.332, 941.008, 402.332], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A425D0>, '/Subtype': '/Popup'}
{'/Popup': IndirectObject(194, 0), '/M': u"D:20190221151452+01'00'", '/CreationDate': u"D:20190221151452+01'00'", '/NM': u'6bf0226e-a3fb-49bf-bc89-05bb671e1627', '/F': 4, '/C': [0, 0, 1], '/Subj': u'Inserted Text', '/Rect': [285.877, 372.978, 298.073, 382.916], '/Type': '/Annot', '/P': IndirectObject(5, 0), '/AP': {'/N': IndirectObject(201, 0)}, '/RD': [0.82816, 0.82816, 0.82816, 0.82816], '/T': u'Admin', '/Subtype': '/Caret'}
{'/Parent': IndirectObject(193, 0), '/Rect': [737.008, 268.088, 941.008, 382.088], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42830>, '/Subtype': '/Popup'}
{'/Popup': IndirectObject(196, 0), '/M': u"D:20190221151519+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'6686b852-3924-4252-af21-c1b10390841f', '/F': 4, '/IRT': IndirectObject(197, 0), '/C': [0, 0, 1], '/Rect': [582.29, 476.745, 650.616, 498.621], '/Type': '/Annot', '/T': u'Admin', '/P': IndirectObject(5, 0), '/QuadPoints': [588.088, 497.406, 644.818, 497.406, 588.088, 477.96, 644.818, 477.96], '/Subj': u'Cross-Out', '/IT': '/StrikeOutTextEdit', '/AP': {'/N': IndirectObject(200, 0)}, '/RT': '/Group', '/Subtype': '/StrikeOut'}
{'/Parent': IndirectObject(195, 0), '/Rect': [737.008, 383.406, 941.008, 497.406], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AF0>, '/Subtype': '/Popup'}
{'/Popup': IndirectObject(198, 0), '/M': u"D:20190221151526+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'72f8d1b7-d878-4281-bd33-3a6fb4578673', '/F': 4, '/C': [0, 0, 1], '/Rect': [636.942, 476.891, 652.693, 489.725], '/Type': '/Annot', '/RD': [1.06952, 1.06952, 1.06952, 1.06952], '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment1</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u' pica', '/Subj': u'Inserted Text', '/IT': '/Replace', '/AP': {'/N': IndirectObject(212, 0)}, '/Subtype': '/Caret'}
{'/Parent': IndirectObject(197, 0), '/Rect': [737.008, 374.656, 941.008, 488.656], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AB0>, '/Subtype': '/Popup'}
If you examine the output, you will realize that the outputs are all more or less the same. Every comment in the xfdf file has two counterparts in PyPDF2's output in python. The /C
attribute is the color of the highlight, in RGB, scaled to floats in the range <0, 1>. /Rect
defines the bounding box of the comment on the page/spread, in points (1/72 of an inch) relative to the lower-left corner of the page, increasing values going right and up. /M
and /CreationDate
are modified and creation times, /QuadPoints
is an array of [x1, y1, x2, y2, ..., xn, yn]
coordinates of a line around the comment, /Subject
, /Type
, /SubType
, /IT
identify the type of the comment, /T
is probably the creator, /RC
is an xhtml representation of the comment's text if there is one. If there is an ink-drawn comment, it will be presented here as having an attribute /InkList
with data in the form [[L1x1, L1y1, L1x2, L1y2, ..., L1xn, L1yn], [L2x1, L2y1, ..., L2xn, L2yn], ..., [Lmx1, Lmy1, ..., Lmxn, Lmyn]]
for line 1, line 2, ..., line m.
如果您检查输出,您会发现输出或多或少都相同。xfdf 文件中的每个注释在 PyPDF2 的 Python 输出中都有两个对应项。该/C
属性是突出显示的颜色,以 RGB 为单位,缩放到 <0, 1> 范围内的浮点数。/Rect
定义页面上评论的边界框/展开,以相对于页面左下角的点(1/72 英寸)为单位,向右和向上增加值。/M
和/CreationDate
被修改,创建时间,/QuadPoints
是阵列[x1, y1, x2, y2, ..., xn, yn]
的注释周围的线的坐标,/Subject
,/Type
,/SubType
,/IT
识别的注释的类型,/T
可能是创建者,/RC
是评论文本的 xhtml 表示(如果有的话)。如果有墨迹注释,它会在此处显示为具有第 1 行、第 2 行、...、第 m 行/InkList
形式的数据的属性[[L1x1, L1y1, L1x2, L1y2, ..., L1xn, L1yn], [L2x1, L2y1, ..., L2xn, L2yn], ..., [Lmx1, Lmy1, ..., Lmxn, Lmyn]]
。
For a more thorough explanation of the various fields you get from getObject()
in the given python code lebeled as line (1), please consult https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdfand especially the section 12.5 Annotations starting at pages 381–413.
有关getObject()
在指定为第 (1) 行的给定 Python 代码中获得的各个字段的更详尽说明,请参阅https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/ PDF32000_2008.pdf,尤其是从第 381-413 页开始的第 12.5 节注释。
回答by neok
Here is a working example (ported from previous answer) extracting annotations with the python module popplerqt5: python3 extract.py sample.pdf
这是一个使用 python 模块popplerqt5提取注释的工作示例(从以前的答案移植):python3 extract.py sample.pdf
import popplerqt5
import argparse
def extract(fn):
doc = popplerqt5.Poppler.Document.load(fn)
annotations = []
for i in range(doc.numPages()):
page = doc.page(i)
for annot in page.annotations():
contents = annot.contents()
if contents:
annotations.append(contents)
print(f'page={i + 1} {contents}')
print(f'{len(annotations)} annotation(s) found')
return annotations
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('fn')
args = parser.parse_args()
extract(args.fn)
回答by joelostblom
The pdf-annotsscript can extract annotations from PDFs. It is built upon PDFMineer.sixand produces output in markdown both for the highlighted text and any annotations made on it, such as comments on highlighted areas or popup boxes. The output would look similar to this:
该PDF-annots脚本可以从PDF文件提取注释。它建立在PDFMineer.six 之上,并为突出显示的文本和在其上所做的任何注释(例如对突出显示区域或弹出框的注释)生成降价输出。输出将类似于以下内容:
* Page 2 Highlight:
> Underlying text that was highlighted
Comment made on highlighted text.
* Page 3 Highlight: "Short highlighted text" -- Short comment.
* Page 4 Text: A note on the page.
The full command options can be seen below.
完整的命令选项可以在下面看到。
usage: pdfannots.py [-h] [-p] [-o OUTFILE] [-n COLS] [-s [SEC [SEC ...]]] [--no-group]
[--print-filename] [-w COLS]
INFILE [INFILE ...]
Extracts annotations from a PDF file in markdown format for use in reviewing.
positional arguments:
INFILE PDF files to process
optional arguments:
-h, --help show this help message and exit
Basic options:
-p, --progress emit progress information
-o OUTFILE output file (default is stdout)
-n COLS, --cols COLS number of columns per page in the document (default: 2)
Options controlling output format:
-s [SEC [SEC ...]], --sections [SEC [SEC ...]]
sections to emit (default: highlights, comments, nits)
--no-group emit annotations in order, don't group into sections
--print-filename print the filename when it has annotations
-w COLS, --wrap COLS wrap text at this many output columns
I haven't tried this out extensively, but it has been working well so far!
我还没有广泛地尝试过这个,但到目前为止它一直运行良好!
回答by creativecoding
Somebody asked a similar question. I tried the code sample there and it did not work for me until I made a few functional and cosmetic changes.
有人问过类似的问题。我在那里尝试了代码示例,但在我进行了一些功能和外观更改之前,它对我不起作用。
#!/usr/bin/ruby
require 'pdf-reader'
ARGV.each do |filename|
PDF::Reader.open(filename) do |reader|
puts "file: #{filename}"
puts "page\tcomment"
reader.pages.each do |page|
annots_ref = page.attributes[:Annots]
if annots_ref
actual_annots = annots_ref.map { |a| reader.objects[a] }
actual_annots.each do |actual_annot|
unless actual_annot[:Contents].nil?
puts "#{page.number}\t#{actual_annot[:Contents]}"
end
end
end
end
end
end
If saved as pdfannot.rb
, chmod +x
'ed and placed into your favourite PATH
directory, usage is:
如果保存为pdfannot.rb
, chmod +x
'ed 并放入您喜欢的PATH
目录,则用法为:
./pdfannot.rb <path>
First time writing/editing/remixing Ruby code, so very open for suggestions. HTH.
第一次编写/编辑/重新混合 Ruby 代码,所以非常欢迎建议。哈。
On a side note, finding this question earlier could have saved me from double work. Hopefully this question gets more attention in the future such that it is easier to find.
顺便提一下,早点发现这个问题可以让我免于双重工作。希望这个问题在未来得到更多关注,以便更容易找到。