python 用于间接对象提取的 pyPdf

Question

提问by Giancarlo

Following this example, I can list all elements into a pdf file

按照这个例子，我可以将所有元素列出到一个 pdf 文件中

import pyPdf
pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
list(pdf.pages) # Process all the objects.
print pdf.resolvedObjects

now, I need to extract a non-standard object from the pdf file.

现在，我需要从 pdf 文件中提取一个非标准对象。

My object is the one named MYOBJECT and it is a string.

我的对象是名为 MYOBJECT 的对象，它是一个字符串。

The piece printed by the python script that concernes me is:

与我有关的python脚本打印的部分是：

{'/MYOBJECT': IndirectObject(584, 0)}

The pdf file is this:

pdf文件是这样的：

558 0 obj
<</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources
  <</ColorSpace <</CS0 563 0 R>>
    /ExtGState <</GS0 568 0 R>>
    /Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>
    /ProcSet[/PDF/Text/ImageC]
    /Properties<</MC0<</MYOBJECT 584 0 R>>/MC1<</SubKey 582 0 R>> >>
    /XObject<</Im0 578 0 R>>>>
  /Rotate 0/StructParents 0/Type/Page>>
endobj
...
...
...
584 0 obj
<</Length 8>>stream

1_22_4_1     --->>>>  this is the string I need to extract from the object

endstream
endobj

How can I follow the 584value in order to refer to my string (under pyPdf of course)??

我怎样才能按照584值来引用我的字符串（当然在 pyPdf 下）？？

Answer 1

采纳答案by Jehiah

each element in pdf.pagesis a dictionary, so assuming it's on page 1, pdf.pages[0]['/MYOBJECT']should be the element you want.

中的每个元素pdf.pages都是一个字典，因此假设它在第 1 页上，pdf.pages[0]['/MYOBJECT']应该是您想要的元素。

You can try to print that individually or poke at it with helpand dirin a python prompt for more about how to get the string you want

您可以尝试单独打印或戳在它help和dir在提示更多关于如何得到你想要的字符串蟒蛇

Edit:

编辑：

after receiving a copy of the pdf, i found the object at pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT']and the value can be retrieved via getData()

收到 pdf 的副本后，我找到了对象，pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT']并且可以通过 getData() 检索该值

the following function gives a more generic way to solve this by recursively looking for the key in question

以下函数提供了一种更通用的方法来通过递归查找有问题的键来解决此问题

import types
import pyPdf
pdf = pyPdf.PdfFileReader(open('file.pdf'))
pages = list(pdf.pages)

def findInDict(needle,haystack):
    for key in haystack.keys():
        try:
            value = haystack[key]
        except:
            continue
        if key == needle:
            return value
        if type(value) == types.DictType or isinstance(value,pyPdf.generic.DictionaryObject):  
            x = findInDict(needle,value)
            if x is not None:
                return x

answer = findInDict('/MYOBJECT',pdf.resolvedObjects).getData()

Answer 2

回答by Tony Meyer

An IndirectObject refers to an actual object (it's like a link or alias so that the total size of the PDF can be reduced when the same content appears in multiple places). The getObject method will give you the actual object.

IndirectObject 指的是一个实际的对象（它就像一个链接或别名，这样当相同的内容出现在多个地方时可以减小 PDF 的总大小）。getObject 方法将为您提供实际的对象。

If the object is a text object, then just doing a str() or unicode() on the object should get you the data inside of it.

如果对象是文本对象，那么只需在对象上执行 str() 或 unicode() 即可获得其中的数据。

Alternatively, pyPdf stores the objects in the resolvedObjects attribute. For example, a PDF that contains this object:

或者，pyPdf 将对象存储在 resolveObjects 属性中。例如，包含此对象的 PDF：

13 0 obj
<< /Type /Catalog /Pages 3 0 R >>
endobj

Can be read with this:

可以这样读：

>>> import pyPdf
>>> pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
>>> pages = list(pdf.pages)
>>> pdf.resolvedObjects
{0: {2: {'/Parent': IndirectObject(3, 0), '/Contents': IndirectObject(4, 0), '/Type': '/Page', '/Resources': IndirectObject(6, 0), '/MediaBox': [0, 0, 595.2756, 841.8898]}, 3: {'/Kids': [IndirectObject(2, 0)], '/Count': 1, '/Type': '/Pages', '/MediaBox': [0, 0, 595.2756, 841.8898]}, 4: {'/Filter': '/FlateDecode'}, 5: 147, 6: {'/ColorSpace': {'/Cs1': IndirectObject(7, 0)}, '/ExtGState': {'/Gs2': IndirectObject(9, 0), '/Gs1': IndirectObject(10, 0)}, '/ProcSet': ['/PDF', '/Text'], '/Font': {'/F1.0': IndirectObject(8, 0)}}, 13: {'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}}}
>>> pdf.resolvedObjects[0][13]
{'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}

Answer 3

回答by Tony Meyer

Jehiah's method is good if looking everywhere for the object. My guess (looking at the PDF) is that it is always in the same place (the first page, in the 'MC0' property), and so a much simpler method of finding the string would be:

如果到处寻找对象，Jehiah 的方法是好的。我的猜测（查看 PDF）是它总是在同一个地方（第一页，在 'MC0' 属性中），因此查找字符串的更简单的方法是：

import pyPdf
pdf = pyPdf.PdfFileReader(open("file.pdf"))
pdf.getPage(0)['/Resources']['/Properties']['/MC0']['/MYOBJECT'].getData()

python 用于间接对象提取的 pyPdf

提问by Giancarlo

采纳答案by Jehiah

回答by Tony Meyer

回答by Tony Meyer

相关推荐

最近更新

标签

python 用于间接对象提取的 pyPdf

提问by Giancarlo

采纳答案by Jehiah

回答by Tony Meyer

回答by Tony Meyer

相关推荐

使用 mod_wsgi 的 Python POST 数据

python 如何在python 3.0中授权通过http下载文件，解决错误？

在 Python 中调整图像大小而不会丢失 EXIF 数据

命名 Python 记录器

相关推荐

最近更新

标签