python 从 pdf 解析注释

Question

提问by davidb

I want a python function that takes a pdf and returns a list of the text of the note annotations in the document. I have looked at python-poppler (https://code.launchpad.net/~poppler-python/poppler-python/trunk) but I can not figure out how to get it to give me anything useful.

我想要一个 python 函数，它接受一个 pdf 并返回文档中注释注释的文本列表。我看过 python-poppler ( https://code.launchpad.net/~poppler-python/poppler-python/trunk)，但我不知道如何让它给我任何有用的东西。

I found the get_annot_mappingmethod and modified the demo program provided to call it via self.current_page.get_annot_mapping(), but I have no idea what to do with an AnnotMapping object. It seems to not be fully implemented, providing only the copy method.

我找到了该get_annot_mapping方法并修改了提供的演示程序以通过调用它self.current_page.get_annot_mapping()，但我不知道如何处理 AnnotMapping 对象。好像还没有完全实现，只提供了copy方法。

If there are any other libraries that provide this function, that's fine as well.

如果有任何其他库提供此功能，那也很好。

Answer 1

采纳答案by davidb

Turns out the bindings were incomplete. It is now fixed. https://bugs.launchpad.net/poppler-python/+bug/397850

结果发现绑定不完整。现在已修复。https://bugs.launchpad.net/poppler-python/+bug/397850

Answer 2

回答by Enno Gr?per

Just in case somebody is looking for some working code. Here is a script I use.

以防万一有人正在寻找一些工作代码。这是我使用的脚本。

import poppler
import sys
import urllib
import os

def main():
  input_filename = sys.argv[1]
    # http://blog.hartwork.org/?p=612
  document = poppler.document_new_from_file('file://%s' % \
    urllib.pathname2url(os.path.abspath(input_filename)), None)
  n_pages = document.get_n_pages()
  all_annots = 0

  for i in range(n_pages):
        page = document.get_page(i)
        annot_mappings = page.get_annot_mapping ()
        num_annots = len(annot_mappings)
        if num_annots > 0:
            for annot_mapping in annot_mappings:
                if  annot_mapping.annot.get_annot_type().value_name != 'POPPLER_ANNOT_LINK':
                    all_annots += 1
                    print 'page: {0:3}, {1:10}, type: {2:10}, content: {3}'.format(i+1, annot_mapping.annot.get_modified(), annot_mapping.annot.get_annot_type().value_nick, annot_mapping.annot.get_contents())

  if all_annots > 0:
    print str(all_annots) + " annotation(s) found"
  else:
    print "no annotations found"

if __name__ == "__main__":
    main()

Answer 3

回答by mxl

You should DEFINITELY have a look at PyPDF2. This amazing library has incredible potential, you can extract whatever from a PDF, including images or comments. Try to start by examining what Acrobat Reader DC (Reader) can give you on a PDF's comments. Take a simple PDF, annotate it (add some comments) with Reader and in the comments tab in the upper right corner, click the horizontal three dots and click Export All To Data File...and select the format with the extension xfdf. This creates a wonderful xml file which you can parse. The format is very transparent and self-evident.

你绝对应该看看PyPDF2。这个惊人的库具有令人难以置信的潜力，您可以从 PDF 中提取任何内容，包括图像或评论。尝试首先检查 Acrobat Reader DC (Reader) 可以为您提供的 PDF 注释。拿一个简单的PDF，用Reader给它批注（添加一些注释），在右上角的注释选项卡中，单击水平三个点，然后单击Export All To Data File...并选择带有扩展名的格式xfdf。这将创建一个美妙的 xml 文件，您可以解析它。格式非常透明，不言而喻。

If, however, you cannot rely on a user clicking this and instead need to extract the same data from a PDF programmatically using python, do not despair, there is a solution. (Inspired by Extract images from PDF without resampling, in python?)

但是，如果您不能依赖用户单击它，而是需要使用 python 以编程方式从 PDF 中提取相同的数据，请不要绝望，有一个解决方案。（灵感来自于从 PDF 中提取图像而无需重新采样，在 python 中？）

Prerequisites:

先决条件：

PyPDF2 (pip install PyPDF2)

PyPDF2 ( pip install PyPDF2)

What Reader gives you in the above mentioned xfdf file, looks like this:

Reader 在上述 xfdf 文件中为您提供的内容如下所示：

<?xml version="1.0" ?>
<xfdf xml:space="preserve" xmlns="http://ns.adobe.com/xfdf/">
    <annots>
        <caret IT="Replace" color="#0000FF" creationdate="D:20190221151519+01'00'" date="D:20190221151526+01'00'" flags="print" fringe="1.069520,1.069520,1.069520,1.069520" name="72f8d1b7-d878-4281-bd33-3a6fb4578673" page="0" rect="636.942000,476.891000,652.693000,489.725000" subject="Inserted Text" title="Admin">
            <contents-richtext>
                <body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
                    <p dir="ltr">
                        <span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"> comment1</span>
                    </p>
                </body>
            </contents-richtext>
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,374.656000,941.008000,488.656000"/>
        </caret>
        <highlight color="#FFD100" coords="183.867000,402.332000,220.968000,402.332000,183.867000,387.587000,220.968000,387.587000" creationdate="D:20190221151441+01'00'" date="D:20190221151448+01'00'" flags="print" name="a18c7fb0-0af3-435e-8c32-1af2af3c46ea" opacity="0.399994" page="0" rect="179.930000,387.126000,224.904000,402.793000" subject="Highlight" title="Admin">
            <contents-richtext>
                <body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
                    <p dir="ltr">
                        <span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span>
                    </p>
                </body>
            </contents-richtext>
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,288.332000,941.008000,402.332000"/>
        </highlight>
        <caret color="#0000FF" creationdate="D:20190221151452+01'00'" date="D:20190221151452+01'00'" flags="print" fringe="0.828156,0.828156,0.828156,0.828156" name="6bf0226e-a3fb-49bf-bc89-05bb671e1627" page="0" rect="285.877000,372.978000,298.073000,382.916000" subject="Inserted Text" title="Admin">
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,268.088000,941.008000,382.088000"/>
        </caret>
        <strikeout IT="StrikeOutTextEdit" color="#0000FF" coords="588.088000,497.406000,644.818000,497.406000,588.088000,477.960000,644.818000,477.960000" creationdate="D:20190221151519+01'00'" date="D:20190221151519+01'00'" flags="print" inreplyto="72f8d1b7-d878-4281-bd33-3a6fb4578673" name="6686b852-3924-4252-af21-c1b10390841f" page="0" rect="582.290000,476.745000,650.616000,498.621000" replyType="group" subject="Cross-Out" title="Admin">
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,383.406000,941.008000,497.406000"/>
        </strikeout>
    </annots>
    <f href="p1.pdf"/>
    <ids modified="ABB10FA107DAAA47822FB5D311112349" original="474F087D87E7E544F6DEB9E0A93ADFB2"/>
</xfdf>

Various types of comments are presented here as tags within an <annots>block. Python can give you almost the same data. To obtain it, have a look at what the output of the following script gives:

各种类型的注释在此处显示为<annots>块中的标签。Python 可以为您提供几乎相同的数据。要获得它，请查看以下脚本的输出结果：

import sys
import PyPDF2, traceback

try :
    src = sys.argv[1]
except :
    src = r'/path/to/my/file.pdf'


input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()

for i in range(nPages) :
    page0 = input1.getPage(i)
    try :
        for annot in page0['/Annots'] :
            print annot.getObject()       # (1)
            print ''
    except : 
        # there are no annotations on this page
        pass

The output for the same file as in the xfdf file above will look like this:

与上面的 xfdf 文件相同的文件的输出将如下所示：

{'/Popup': IndirectObject(192, 0), '/M': u"D:20190221151448+01'00'", '/CreationDate': u"D:20190221151441+01'00'", '/NM': u'a18c7fb0-0af3-435e-8c32-1af2af3c46ea', '/F': 4, '/C': [1, 0.81961, 0], '/Rect': [179.93, 387.126, 224.904, 402.793], '/Type': '/Annot', '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u'otrasneho', '/QuadPoints': [183.867, 402.332, 220.968, 402.332, 183.867, 387.587, 220.968, 387.587], '/Subj': u'Highlight', '/CA': 0.39999, '/AP': {'/N': IndirectObject(202, 0)}, '/Subtype': '/Highlight'}

{'/Parent': IndirectObject(191, 0), '/Rect': [737.008, 288.332, 941.008, 402.332], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A425D0>, '/Subtype': '/Popup'}

{'/Popup': IndirectObject(194, 0), '/M': u"D:20190221151452+01'00'", '/CreationDate': u"D:20190221151452+01'00'", '/NM': u'6bf0226e-a3fb-49bf-bc89-05bb671e1627', '/F': 4, '/C': [0, 0, 1], '/Subj': u'Inserted Text', '/Rect': [285.877, 372.978, 298.073, 382.916], '/Type': '/Annot', '/P': IndirectObject(5, 0), '/AP': {'/N': IndirectObject(201, 0)}, '/RD': [0.82816, 0.82816, 0.82816, 0.82816], '/T': u'Admin', '/Subtype': '/Caret'}

{'/Parent': IndirectObject(193, 0), '/Rect': [737.008, 268.088, 941.008, 382.088], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42830>, '/Subtype': '/Popup'}

{'/Popup': IndirectObject(196, 0), '/M': u"D:20190221151519+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'6686b852-3924-4252-af21-c1b10390841f', '/F': 4, '/IRT': IndirectObject(197, 0), '/C': [0, 0, 1], '/Rect': [582.29, 476.745, 650.616, 498.621], '/Type': '/Annot', '/T': u'Admin', '/P': IndirectObject(5, 0), '/QuadPoints': [588.088, 497.406, 644.818, 497.406, 588.088, 477.96, 644.818, 477.96], '/Subj': u'Cross-Out', '/IT': '/StrikeOutTextEdit', '/AP': {'/N': IndirectObject(200, 0)}, '/RT': '/Group', '/Subtype': '/StrikeOut'}

{'/Parent': IndirectObject(195, 0), '/Rect': [737.008, 383.406, 941.008, 497.406], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AF0>, '/Subtype': '/Popup'}

{'/Popup': IndirectObject(198, 0), '/M': u"D:20190221151526+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'72f8d1b7-d878-4281-bd33-3a6fb4578673', '/F': 4, '/C': [0, 0, 1], '/Rect': [636.942, 476.891, 652.693, 489.725], '/Type': '/Annot', '/RD': [1.06952, 1.06952, 1.06952, 1.06952], '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment1</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u' pica', '/Subj': u'Inserted Text', '/IT': '/Replace', '/AP': {'/N': IndirectObject(212, 0)}, '/Subtype': '/Caret'}

{'/Parent': IndirectObject(197, 0), '/Rect': [737.008, 374.656, 941.008, 488.656], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AB0>, '/Subtype': '/Popup'}

If you examine the output, you will realize that the outputs are all more or less the same. Every comment in the xfdf file has two counterparts in PyPDF2's output in python. The /Cattribute is the color of the highlight, in RGB, scaled to floats in the range <0, 1>. /Rectdefines the bounding box of the comment on the page/spread, in points (1/72 of an inch) relative to the lower-left corner of the page, increasing values going right and up. /Mand /CreationDateare modified and creation times, /QuadPointsis an array of [x1, y1, x2, y2, ..., xn, yn]coordinates of a line around the comment, /Subject, /Type, /SubType, /ITidentify the type of the comment, /Tis probably the creator, /RCis an xhtml representation of the comment's text if there is one. If there is an ink-drawn comment, it will be presented here as having an attribute /InkListwith data in the form [[L1x1, L1y1, L1x2, L1y2, ..., L1xn, L1yn], [L2x1, L2y1, ..., L2xn, L2yn], ..., [Lmx1, Lmy1, ..., Lmxn, Lmyn]]for line 1, line 2, ..., line m.

如果您检查输出，您会发现输出或多或少都相同。xfdf 文件中的每个注释在 PyPDF2 的 Python 输出中都有两个对应项。该/C属性是突出显示的颜色，以 RGB 为单位，缩放到 <0, 1> 范围内的浮点数。/Rect定义页面上评论的边界框/展开，以相对于页面左下角的点（1/72 英寸）为单位，向右和向上增加值。/M和/CreationDate被修改，创建时间，/QuadPoints是阵列[x1, y1, x2, y2, ..., xn, yn]的注释周围的线的坐标，/Subject，/Type，/SubType，/IT识别的注释的类型，/T可能是创建者，/RC是评论文本的 xhtml 表示（如果有的话）。如果有墨迹注释，它会在此处显示为具有第 1 行、第 2 行、...、第 m 行/InkList形式的数据的属性[[L1x1, L1y1, L1x2, L1y2, ..., L1xn, L1yn], [L2x1, L2y1, ..., L2xn, L2yn], ..., [Lmx1, Lmy1, ..., Lmxn, Lmyn]]。

For a more thorough explanation of the various fields you get from getObject()in the given python code lebeled as line (1), please consult https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdfand especially the section 12.5 Annotations starting at pages 381–413.

有关getObject()在指定为第 (1) 行的给定 Python 代码中获得的各个字段的更详尽说明，请参阅https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/ PDF32000_2008.pdf，尤其是从第 381-413 页开始的第 12.5 节注释。

Answer 4

回答by neok

Here is a working example (ported from previous answer) extracting annotations with the python module popplerqt5: python3 extract.py sample.pdf

这是一个使用 python 模块popplerqt5提取注释的工作示例（从以前的答案移植）：python3 extract.py sample.pdf

import popplerqt5
import argparse


def extract(fn):
    doc = popplerqt5.Poppler.Document.load(fn)
    annotations = []
    for i in range(doc.numPages()):
        page = doc.page(i)
        for annot in page.annotations():
            contents = annot.contents()
            if contents:
                annotations.append(contents)
                print(f'page={i + 1} {contents}')

    print(f'{len(annotations)} annotation(s) found')
    return annotations


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('fn')
    args = parser.parse_args()
    extract(args.fn)

Answer 5

回答by joelostblom

The pdf-annotsscript can extract annotations from PDFs. It is built upon PDFMineer.sixand produces output in markdown both for the highlighted text and any annotations made on it, such as comments on highlighted areas or popup boxes. The output would look similar to this:

该PDF-annots脚本可以从PDF文件提取注释。它建立在PDFMineer.six 之上，并为突出显示的文本和在其上所做的任何注释（例如对突出显示区域或弹出框的注释）生成降价输出。输出将类似于以下内容：

 * Page 2 Highlight:
 > Underlying text that was highlighted

 Comment made on highlighted text.

 * Page 3 Highlight: "Short highlighted text" -- Short comment.

 * Page 4 Text: A note on the page.

The full command options can be seen below.

完整的命令选项可以在下面看到。

usage: pdfannots.py [-h] [-p] [-o OUTFILE] [-n COLS] [-s [SEC [SEC ...]]] [--no-group]
                    [--print-filename] [-w COLS]
                    INFILE [INFILE ...]

Extracts annotations from a PDF file in markdown format for use in reviewing.

positional arguments:
  INFILE                PDF files to process

optional arguments:
  -h, --help            show this help message and exit

Basic options:
  -p, --progress        emit progress information
  -o OUTFILE            output file (default is stdout)
  -n COLS, --cols COLS  number of columns per page in the document (default: 2)

Options controlling output format:
  -s [SEC [SEC ...]], --sections [SEC [SEC ...]]
                        sections to emit (default: highlights, comments, nits)
  --no-group            emit annotations in order, don't group into sections
  --print-filename      print the filename when it has annotations
  -w COLS, --wrap COLS  wrap text at this many output columns

I haven't tried this out extensively, but it has been working well so far!

我还没有广泛地尝试过这个，但到目前为止它一直运行良好！

Answer 6

回答by creativecoding

Somebody asked a similar question. I tried the code sample there and it did not work for me until I made a few functional and cosmetic changes.

有人问过类似的问题。我在那里尝试了代码示例，但在我进行了一些功能和外观更改之前，它对我不起作用。

#!/usr/bin/ruby

require 'pdf-reader'

ARGV.each do |filename|
  PDF::Reader.open(filename) do |reader|
    puts "file: #{filename}"
    puts "page\tcomment"
    reader.pages.each do |page|
      annots_ref = page.attributes[:Annots]
      if annots_ref
        actual_annots = annots_ref.map { |a| reader.objects[a] }
        actual_annots.each do |actual_annot|
          unless actual_annot[:Contents].nil?
            puts "#{page.number}\t#{actual_annot[:Contents]}"
          end
        end
      end
    end       
  end
end

If saved as pdfannot.rb, chmod +x'ed and placed into your favourite PATHdirectory, usage is:

如果保存为pdfannot.rb, chmod +x'ed 并放入您喜欢的PATH目录，则用法为：

./pdfannot.rb <path>

First time writing/editing/remixing Ruby code, so very open for suggestions. HTH.

第一次编写/编辑/重新混合 Ruby 代码，所以非常欢迎建议。哈。

On a side note, finding this question earlier could have saved me from double work. Hopefully this question gets more attention in the future such that it is easier to find.

顺便提一下，早点发现这个问题可以让我免于双重工作。希望这个问题在未来得到更多关注，以便更容易找到。

Answer 7

回答by zeroDivisible

I didn't ever used this, nor I wanted this kind of features, but I found PDFMiner- this link has information about basic usage, maybe this is what You are looking for?

我从来没有用过这个，也不想要这种功能，但是我找到了PDFMiner- 这个链接有关于基本用法的信息，也许这就是你要找的？

python 从 pdf 解析注释

提问by davidb

采纳答案by davidb

回答by Enno Gr?per

回答by mxl

回答by neok

回答by joelostblom

回答by creativecoding

回答by zeroDivisible

相关推荐

最近更新

标签

python 从 pdf 解析注释

提问by davidb

采纳答案by davidb

回答by Enno Gr?per

回答by mxl

回答by neok

回答by joelostblom

回答by creativecoding

回答by zeroDivisible

相关推荐

Python NotImplemented 常量

在 Python 中，使用 pyodbc，您如何执行事务？

python 如何使用 PIL 减少调色板

将 Perl 翻译成 Python

相关推荐

最近更新

标签