Python 从 .pdf 中提取特定数据并保存在 Excel 文件中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33917637/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:11:36  来源:igfitidea点击:

Extract specific data from .pdf and save in Excel file

pythonextract

提问by Xavier Villafaina

Every month I need extract some data from .pdf files to create an Excel table.

每个月我都需要从 .pdf 文件中提取一些数据来创建 Excel 表格。

I'm able to convert the .pdf file to text but I'm not sure how to extract and save the specific information I want. Now I have this code:

我能够将 .pdf 文件转换为文本,但我不确定如何提取和保存我想要的特定信息。现在我有这个代码:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    fstr = ''
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,    password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

        str = retstr.getvalue()
        fstr += str

    fp.close()
    device.close()
    retstr.close()
    return fstr

print convert_pdf_to_txt("FA20150518.pdf")

And this is the result:

这是结果:

    >>> 
AVILA?72,?VALLDOREIX
08197?SANT?CUGAT?DEL?VALLES
(BARCELONA)
TELF:?935441851
NIF:?B65512725
EMAIL:[email protected]

JOSE?LUIS?MARTINEZ?LOPEZ

AVDA.?DEL?ESLA,?33-D
24240?SANTA?MARIA?DEL?PARAMO
LEON
TELF:?600871170

FECHA
17/06/15

FACTURA
??20150518

CLIENTE
43000335

N.I.F.

71548163?B

PáG.

1

No?VIAJE

RUTA

DESTINATARIO?/?REFERENCIA

KG

BULTOS

IMPORTE

2015064210-08/06/15

CERDANYOLA?DEL?VALLES?->?VINAROS

FERRER?ALIMENTACION?-?VINAROZ

2,000.0

1

?????????150,00

TOTAL?IMP.

%

IMPORTE

BASE

?????????150,00

?????????150,00

%
?21,00

IVA

%

REC.

TOTAL?FRA.

()

??????????31,50

?????????181,50

Eur

Forma?Pago:
Banco:

CONTADO

Vencimientos:
17/06/15
181,50

Ok, now I have the text in the variable convert_pdf_to_txt.

好的,现在我在变量 convert_pdf_to_txt 中有文本。

I want extract this information: Customer, Number of bill, Price, expiration date and way to pay.

我想提取这些信息:客户、账单数量、价格、到期日期和支付方式。

Customer name always are down "EMAIL:[email protected]"

客户名称总是在“EMAIL:[email protected]

Number of bill always are down "FACTURA"

账单数量总是下降“FACTURA”

Price always are down two lines "Vencimientos:"

价格总是下降两行“Vencimientos:”

Expiration date always are down "Vencimientos:"

到期日期总是向下“Vencimientos:”

Way to pay always down "Banco:"

始终支付“Banco”的方式:

I think in do something like this. If I can convert this text into a list and can do something like this:

我想在做这样的事情。如果我可以将此文本转换为列表并可以执行以下操作:

Searching Customer:

寻找客户:

 i=0
 while i < lengthlist
   if listitem[i] == "EMAIL:[email protected]"
      i+1
      Customer = listitem[i]
      i = lengthlist
   else:
     i+1

Searching bill Number:

查询账单号码:

 i=0
 while i < lengthlist
   if listitem[i] == "FACTURA"
      i+1
      Customer = listitem[i]
      i = lengthlist
   else:
     i+1

After I don't know how to save in Excel but I'm sure I can find examples in the forum but first I need to extract only this data.

在我不知道如何在 Excel 中保存之后,但我确信我可以在论坛中找到示例,但首先我只需要提取这些数据。

回答by Mel

Let's take a simpler example, that I hope represent your issue.

让我们举一个更简单的例子,我希望它代表你的问题。

You have a string stringPDFlike this:

你有一个stringPDF这样的字符串:

name1 \n
\n
value1 \n
name2 \n
value2 \n
\n
name3 \n
otherValue \n
value3 \n

A value is X lines after a name (in your example X is often 1, sometimes 2, but let's just say it can be any number). \nrepresent the line breaks (when you print the string, it prints on multiple lines)

值是名称后的 X 行(在您的示例中,X 通常是 1,有时是 2,但我们只是说它可以是任何数字)。\n代表换行符(当你打印字符串时,它会打印在多行上)

First, we convert the string to a list of lines, by splitting where there are line breaks:

首先,我们通过在有换行符的地方拆分来将字符串转换为行列表:

>>> stringList=stringPDF.split("\n")
>>> print(stringList)
['name1 ', '', 'value1 ', 'name2 ', 'value2 ', '', 'name3 ', 'otherValue ', 'value3 ', '']

Depending on your string, you may need to clean it. Here I have some extra whitespace at the end ('name1 'instead of 'name1'). I use a list comprehension and strip()to remove it:

根据您的字符串,您可能需要清洁它。在这里,我在末尾有一些额外的空格('name1 '而不是'name1')。我使用列表理解并将strip()其删除:

stringList=[line.strip() for line in stringList]

Once we have a proper list, we can define a simple function that return a value, given the name and X (X lines between name and value):

一旦我们有了合适的列表,我们就可以定义一个简单的函数来返回一个值,给定名称和 X(名称和值之间的 X 行):

def get_value(l,name,Xline):
    indexName=l.index(name)  #find the index of the name in the list
    indexValue=indexName+Xline # add X to this index
    return l[indexValue]  #get the value

>>>print(get_value(stringList,"name2",1))
"value2"

回答by Michel

Try something like this:

尝试这样的事情:

txtList = convert_pdf_to_txt("FA20150518.pdf").splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1

for idx, line in enumerate(txtList):
    if "EMAIL: [email protected]" in line:
        nameIdx = idx + 1 # in your example it should be +2...

    if "FACTURA" in line:
        billNumIdx = idx + 1

    if "Vencimientos:" in line:
        priceIdx = idx + 2
        expirDateIdx = idx + 1

    if "Banco:" in line:
        paymentIdx = idx + 1

name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''

If you are sure that the key lines only contain what you are looking for ("FACTURA" and so on) you can replace the conditions with

如果您确定关键行仅包含您要查找的内容(“FACTURA”等),您可以将条件替换为

if line == "FACTURA":

回答by Riet

You had the right idea

你的想法是对的

string = convert_pdf_to_txt("FA20150518.pdf")
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
    if 'EMAIL:' in lines[i]:
        custData['Name'] = lines[i+1]
    elif 'FACTURA' in lines[i]:
        custData['BillNumber'] = lines[i+1]
    elif 'Vencimientos:' in lines[i]:
        custData['price'] = lines[i+2]
    elif 'Banco:' in lines[i]:
        custData['paymentType'] = lines[i+1]
print(custData)

回答by Xavier Villafaina

Thanks for your help I take code from two examples you give me and now I can extract all info I want.

感谢您的帮助,我从您提供的两个示例中获取代码,现在我可以提取我想要的所有信息。

# -*- coding: cp1252 -*-
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    fstr = ''
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,    password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

        str = retstr.getvalue()
        fstr += str

    fp.close()
    device.close()
    retstr.close()
    return fstr


factura = "FA20150483.pdf"
#ejemplo 1

string = convert_pdf_to_txt(factura)
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
    if 'EMAIL:' in lines[i]:
        custData['Name'] = lines[i+1]
    elif 'FACTURA' in lines[i]:
        custData['BillNumber'] = lines[i+1]
    elif 'Vencimientos:' in lines[i]:
        custData['price'] = lines[i+2]
    elif 'Banco:' in lines[i]:
        custData['paymentType'] = lines[i+1]



#ejemplo 2
txtList = convert_pdf_to_txt(factura).splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1

for idx, line in enumerate(txtList):
    if line == "EMAIL: [email protected]":
        nameIdx = idx +2 # in your example it should be +2...

    if line == "FACTURA":
        billNumIdx = idx + 1

    if "Vencimientos:" in line:
        priceIdx = idx + 2
        expirDateIdx = idx + 1

    if "Banco:" in line:
        paymentIdx = idx + 1

name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''


print expirDate

billNum = billNum.replace("????", "")


print billNum


custData['Name'] = custData['Name'].replace("?", "")

print custData['Name']


custData['paymentType'] = custData['paymentType'].replace("?", "")

print custData['paymentType']

print price

Few examples:

几个例子:

    >>> 
25/06/15
20150480
BABY?RACE?S.L.
REMESA?DIA?25?FECHA?FACTURA
15,23
>>> ================================ RESTART ================================
>>> 
05/06/15
20150481
LOFT?CUINA,?S.L.
DIA?5?FECHA?FACTURA
91,79
>>> ================================ RESTART ================================
>>> 
05/06/15
20150482
GRAFIQUES?MOGENT?S.L.
DIA?5?FECHA?FACTURA
128,42
>>> ================================ RESTART ================================
>>> 
30/06/15
20150483
CHIEMIVALL?SL
30?DIAS?FECHA?FACTURA
1.138,58
>>>