用于将PDF转换为文本的Python模块

时间:2020-03-05 18:42:44  来源:igfitidea点击:

哪些是将PDF文件转换为文本的最佳Python模块?

解决方案

回答

试试PDFMiner。它可以从PDF文件中以HTML,SGML或者"标记的PDF"格式提取文本。

http://www.unixuser.org/~euske/python/pdfminer/index.html

Tagged PDF格式似乎是最干净的格式,而去掉XML标签只会留下纯文本。

Python 3版本可在以下位置获得:

  • https://github.com/pdfminer/pdfminer.six

回答

PDFminer在我尝试使用的pdf文件的每一页上给了我一行[第7页,共1页...]。

到目前为止,我最好的答案是pdftoipe或者基于Xpdf的c ++代码。

请参阅我的问题,了解pdftoipe的输出是什么样的。

回答

Pdftotext一个开放源代码程序(Xpdf的一部分),我们可以从python调用它(不是我们想要的,但是可能有用)。我使用它没有问题。我认为Google在Google桌面中使用它。

回答

pyPDF可以正常工作(假设我们使用的是格式良好的PDF)。如果我们想要的只是文本(带有空格),则可以执行以下操作:

import pyPdf
pdf = pyPdf.PdfFileReader(open(filename, "rb"))
for page in pdf.pages:
    print page.extractText()

我们还可以轻松访问元数据,图像数据等。

extractText代码中的注释说明:

Locate all text drawing commands, in
  the order they are provided in the
  content stream, and extract the text. 
  This works well for some PDF files,
  but poorly for others, depending on
  the generator used.  This will be
  refined in the future.  Do not rely on
  the order of text coming out of this
  function, as it will change if this
  function is made more sophisticated.

这是否是一个问题取决于我们对文本的处理方式(例如,顺序无关紧要,就可以了,或者如果生成器按显示顺序将文本添加到流中,就可以了) 。我在日常使用中有pyPdf提取代码,没有任何问题。

回答

另外,还有PDFTextStream,这是一个商业Java库,也可以从Python使用。

回答

我们也可以很容易地将pdfminer用作库。我们可以访问pdf的内容模型,并且可以创建自己的文本提取。我使用以下代码将pdf内容转换为以分号分隔的文本。

该函数仅根据TextItem内容对象的y和x坐标对它们进行排序,并输出与一条文本行具有相同y坐标的项目,并用';'将同一行上的对象分隔开人物。

使用这种方法,我能够从pdf提取文本,而其他工具无法提取适合进一步解析的内容。我尝试过的其他工具包括pdftotext,ps2ascii和在线工具pdftextonline.com。

pdfminer是pdf抓取的宝贵工具。

def pdf_to_csv(filename):
    from pdflib.page import TextItem, TextConverter
    from pdflib.pdfparser import PDFDocument, PDFParser
    from pdflib.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, TextItem):
                    (_,_,x,y) = child.bbox
                    line = lines[int(-y)]
                    line[x] = child.text

            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, "ascii")

    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(doc, fp)
    doc.initialize('')

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %d\n" % i)
        interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

更新:

上面的代码是针对API的旧版本编写的,请参见下面的评论。

回答

自Codeape发布以来,PDFMiner软件包已更改。

编辑(再次):

PDFMiner已在20100213版本中再次更新。

我们可以使用以下方法检查已安装的版本:

>>> import pdfminer
>>> pdfminer.__version__
'20100213'

这是更新的版本(带有有关我更改/添加的内容的注释):

def pdf_to_csv(filename):
    from cStringIO import StringIO  #<-- added so you can copy/paste this to try it
    from pdfminer.converter import LTTextItem, TextConverter
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, LTTextItem):
                    (_,_,x,y) = child.bbox                   #<-- changed
                    line = lines[int(-y)]
                    line[x] = child.text.encode(self.codec)  #<-- changed

            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8")  #<-- changed 
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)

    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)       #<-- changed
    parser.set_document(doc)     #<-- added
    doc.set_parser(parser)       #<-- added
    doc.initialize('')

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %d\n" % i)
        interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

编辑(再次):

这是pypi最新版本的更新,20100619p1。简而言之,我用LTChar替换了LTTextItem,并将LAParams的实例传递给CsvConverter构造函数。

def pdf_to_csv(filename):
    from cStringIO import StringIO  
    from pdfminer.converter import LTChar, TextConverter    #<-- changed
    from pdfminer.layout import LAParams
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, LTChar):               #<-- changed
                    (_,_,x,y) = child.bbox                   
                    line = lines[int(-y)]
                    line[x] = child.text.encode(self.codec)

            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())  #<-- changed
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)

    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)       
    parser.set_document(doc)     
    doc.set_parser(parser)       
    doc.initialize('')

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %d\n" % i)
        if page is not None:
            interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

编辑(再过一次):

已针对版本20110515更新(感谢Oeufcoque Penteano!):

def pdf_to_csv(filename):
    from cStringIO import StringIO  
    from pdfminer.converter import LTChar, TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item._objs:                #<-- changed
                if isinstance(child, LTChar):
                    (_,_,x,y) = child.bbox                   
                    line = lines[int(-y)]
                    line[x] = child._text.encode(self.codec) #<-- changed

            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)

    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)       
    parser.set_document(doc)     
    doc.set_parser(parser)       
    doc.initialize('')

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %d\n" % i)
        if page is not None:
            interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

回答

重新利用pdfminer随附的pdf2txt.py代码;我们可以使函数采用pdf路径; (可选)为输出类型(txt | html | xml | tag),并选择类似命令行pdf2txt {'-o':'/path/to/outfile.txt'...}。默认情况下,我们可以调用:

convert_pdf(path)

将创建一个文本文件,该文件在文件系统上为原始pdf的同级文件。

def convert_pdf(path, outtype='txt', opts={}):
    import sys
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
    from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, TagExtractor
    from pdfminer.layout import LAParams
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfdevice import PDFDevice
    from pdfminer.cmapdb import CMapDB

    outfile = path[:-3] + outtype
    outdir = '/'.join(path.split('/')[:-1])

    debug = 0
    # input option
    password = ''
    pagenos = set()
    maxpages = 0
    # output option
    codec = 'utf-8'
    pageno = 1
    scale = 1
    showpageno = True
    laparams = LAParams()
    for (k, v) in opts:
        if k == '-d': debug += 1
        elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') )
        elif k == '-m': maxpages = int(v)
        elif k == '-P': password = v
        elif k == '-o': outfile = v
        elif k == '-n': laparams = None
        elif k == '-A': laparams.all_texts = True
        elif k == '-D': laparams.writing_mode = v
        elif k == '-M': laparams.char_margin = float(v)
        elif k == '-L': laparams.line_margin = float(v)
        elif k == '-W': laparams.word_margin = float(v)
        elif k == '-O': outdir = v
        elif k == '-t': outtype = v
        elif k == '-c': codec = v
        elif k == '-s': scale = float(v)
    #
    CMapDB.debug = debug
    PDFResourceManager.debug = debug
    PDFDocument.debug = debug
    PDFParser.debug = debug
    PDFPageInterpreter.debug = debug
    PDFDevice.debug = debug
    #
    rsrcmgr = PDFResourceManager()
    if not outtype:
        outtype = 'txt'
        if outfile:
            if outfile.endswith('.htm') or outfile.endswith('.html'):
                outtype = 'html'
            elif outfile.endswith('.xml'):
                outtype = 'xml'
            elif outfile.endswith('.tag'):
                outtype = 'tag'
    if outfile:
        outfp = file(outfile, 'w')
    else:
        outfp = sys.stdout
    if outtype == 'txt':
        device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
    elif outtype == 'xml':
        device = XMLConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, outdir=outdir)
    elif outtype == 'html':
        device = HTMLConverter(rsrcmgr, outfp, codec=codec, scale=scale, laparams=laparams, outdir=outdir)
    elif outtype == 'tag':
        device = TagExtractor(rsrcmgr, outfp, codec=codec)
    else:
        return usage()

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password)
    fp.close()
    device.close()

    outfp.close()
    return

回答

我已经将pdftohtml与'-xml'参数一起使用,并通过subprocess.Popen()读取了结果,这将为我们提供pdf中每个文本"片段"的x坐标,y坐标,宽度,高度和字体。我认为这也是"证据"可能会使用的原因,因为出现了相同的错误消息。

如果我们需要处理柱状数据,由于必须发明一种适合pdf文件的算法,它会变得稍微复杂一些。问题在于制作PDF文件的程序实际上不一定以任何逻辑格式对文本进行布局。我们可以尝试使用简单的排序算法,并且有时可以使用,但是可能很少出现"散乱"和"杂散"的情况,这些文字没有按照我们认为的顺序排列...因此我们必须发挥创造力。

我花了大约5个小时才弄清一份我正在研究的pdf文件。但现在效果很好。祝你好运。