Python 从 .pdf 中提取特定数据并保存在 Excel 文件中
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33917637/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract specific data from .pdf and save in Excel file
提问by Xavier Villafaina
Every month I need extract some data from .pdf files to create an Excel table.
每个月我都需要从 .pdf 文件中提取一些数据来创建 Excel 表格。
I'm able to convert the .pdf file to text but I'm not sure how to extract and save the specific information I want. Now I have this code:
我能够将 .pdf 文件转换为文本,但我不确定如何提取和保存我想要的特定信息。现在我有这个代码:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
fstr = ''
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
str = retstr.getvalue()
fstr += str
fp.close()
device.close()
retstr.close()
return fstr
print convert_pdf_to_txt("FA20150518.pdf")
And this is the result:
这是结果:
>>>
AVILA?72,?VALLDOREIX
08197?SANT?CUGAT?DEL?VALLES
(BARCELONA)
TELF:?935441851
NIF:?B65512725
EMAIL:[email protected]
JOSE?LUIS?MARTINEZ?LOPEZ
AVDA.?DEL?ESLA,?33-D
24240?SANTA?MARIA?DEL?PARAMO
LEON
TELF:?600871170
FECHA
17/06/15
FACTURA
??20150518
CLIENTE
43000335
N.I.F.
71548163?B
PáG.
1
No?VIAJE
RUTA
DESTINATARIO?/?REFERENCIA
KG
BULTOS
IMPORTE
2015064210-08/06/15
CERDANYOLA?DEL?VALLES?->?VINAROS
FERRER?ALIMENTACION?-?VINAROZ
2,000.0
1
?????????150,00
TOTAL?IMP.
%
IMPORTE
BASE
?????????150,00
?????????150,00
%
?21,00
IVA
%
REC.
TOTAL?FRA.
()
??????????31,50
?????????181,50
Eur
Forma?Pago:
Banco:
CONTADO
Vencimientos:
17/06/15
181,50
Ok, now I have the text in the variable convert_pdf_to_txt.
好的,现在我在变量 convert_pdf_to_txt 中有文本。
I want extract this information: Customer, Number of bill, Price, expiration date and way to pay.
我想提取这些信息:客户、账单数量、价格、到期日期和支付方式。
Customer name always are down "EMAIL:[email protected]"
客户名称总是在“EMAIL:[email protected]”
Number of bill always are down "FACTURA"
账单数量总是下降“FACTURA”
Price always are down two lines "Vencimientos:"
价格总是下降两行“Vencimientos:”
Expiration date always are down "Vencimientos:"
到期日期总是向下“Vencimientos:”
Way to pay always down "Banco:"
始终支付“Banco”的方式:
I think in do something like this. If I can convert this text into a list and can do something like this:
我想在做这样的事情。如果我可以将此文本转换为列表并可以执行以下操作:
Searching Customer:
寻找客户:
i=0
while i < lengthlist
if listitem[i] == "EMAIL:[email protected]"
i+1
Customer = listitem[i]
i = lengthlist
else:
i+1
Searching bill Number:
查询账单号码:
i=0
while i < lengthlist
if listitem[i] == "FACTURA"
i+1
Customer = listitem[i]
i = lengthlist
else:
i+1
After I don't know how to save in Excel but I'm sure I can find examples in the forum but first I need to extract only this data.
在我不知道如何在 Excel 中保存之后,但我确信我可以在论坛中找到示例,但首先我只需要提取这些数据。
回答by Mel
Let's take a simpler example, that I hope represent your issue.
让我们举一个更简单的例子,我希望它代表你的问题。
You have a string stringPDF
like this:
你有一个stringPDF
这样的字符串:
name1 \n
\n
value1 \n
name2 \n
value2 \n
\n
name3 \n
otherValue \n
value3 \n
A value is X lines after a name (in your example X is often 1, sometimes 2, but let's just say it can be any number). \n
represent the line breaks (when you print the string, it prints on multiple lines)
值是名称后的 X 行(在您的示例中,X 通常是 1,有时是 2,但我们只是说它可以是任何数字)。\n
代表换行符(当你打印字符串时,它会打印在多行上)
First, we convert the string to a list of lines, by splitting where there are line breaks:
首先,我们通过在有换行符的地方拆分来将字符串转换为行列表:
>>> stringList=stringPDF.split("\n")
>>> print(stringList)
['name1 ', '', 'value1 ', 'name2 ', 'value2 ', '', 'name3 ', 'otherValue ', 'value3 ', '']
Depending on your string, you may need to clean it. Here I have some extra whitespace at the end ('name1 '
instead of 'name1'
). I use a list comprehension and strip()
to remove it:
根据您的字符串,您可能需要清洁它。在这里,我在末尾有一些额外的空格('name1 '
而不是'name1'
)。我使用列表理解并将strip()
其删除:
stringList=[line.strip() for line in stringList]
Once we have a proper list, we can define a simple function that return a value, given the name and X (X lines between name and value):
一旦我们有了合适的列表,我们就可以定义一个简单的函数来返回一个值,给定名称和 X(名称和值之间的 X 行):
def get_value(l,name,Xline):
indexName=l.index(name) #find the index of the name in the list
indexValue=indexName+Xline # add X to this index
return l[indexValue] #get the value
>>>print(get_value(stringList,"name2",1))
"value2"
回答by Michel
Try something like this:
尝试这样的事情:
txtList = convert_pdf_to_txt("FA20150518.pdf").splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1
for idx, line in enumerate(txtList):
if "EMAIL: [email protected]" in line:
nameIdx = idx + 1 # in your example it should be +2...
if "FACTURA" in line:
billNumIdx = idx + 1
if "Vencimientos:" in line:
priceIdx = idx + 2
expirDateIdx = idx + 1
if "Banco:" in line:
paymentIdx = idx + 1
name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''
If you are sure that the key lines only contain what you are looking for ("FACTURA" and so on) you can replace the conditions with
如果您确定关键行仅包含您要查找的内容(“FACTURA”等),您可以将条件替换为
if line == "FACTURA":
回答by Riet
You had the right idea
你的想法是对的
string = convert_pdf_to_txt("FA20150518.pdf")
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
if 'EMAIL:' in lines[i]:
custData['Name'] = lines[i+1]
elif 'FACTURA' in lines[i]:
custData['BillNumber'] = lines[i+1]
elif 'Vencimientos:' in lines[i]:
custData['price'] = lines[i+2]
elif 'Banco:' in lines[i]:
custData['paymentType'] = lines[i+1]
print(custData)
回答by Xavier Villafaina
Thanks for your help I take code from two examples you give me and now I can extract all info I want.
感谢您的帮助,我从您提供的两个示例中获取代码,现在我可以提取我想要的所有信息。
# -*- coding: cp1252 -*-
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
fstr = ''
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
str = retstr.getvalue()
fstr += str
fp.close()
device.close()
retstr.close()
return fstr
factura = "FA20150483.pdf"
#ejemplo 1
string = convert_pdf_to_txt(factura)
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
if 'EMAIL:' in lines[i]:
custData['Name'] = lines[i+1]
elif 'FACTURA' in lines[i]:
custData['BillNumber'] = lines[i+1]
elif 'Vencimientos:' in lines[i]:
custData['price'] = lines[i+2]
elif 'Banco:' in lines[i]:
custData['paymentType'] = lines[i+1]
#ejemplo 2
txtList = convert_pdf_to_txt(factura).splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1
for idx, line in enumerate(txtList):
if line == "EMAIL: [email protected]":
nameIdx = idx +2 # in your example it should be +2...
if line == "FACTURA":
billNumIdx = idx + 1
if "Vencimientos:" in line:
priceIdx = idx + 2
expirDateIdx = idx + 1
if "Banco:" in line:
paymentIdx = idx + 1
name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''
print expirDate
billNum = billNum.replace("????", "")
print billNum
custData['Name'] = custData['Name'].replace("?", "")
print custData['Name']
custData['paymentType'] = custData['paymentType'].replace("?", "")
print custData['paymentType']
print price
Few examples:
几个例子:
>>>
25/06/15
20150480
BABY?RACE?S.L.
REMESA?DIA?25?FECHA?FACTURA
15,23
>>> ================================ RESTART ================================
>>>
05/06/15
20150481
LOFT?CUINA,?S.L.
DIA?5?FECHA?FACTURA
91,79
>>> ================================ RESTART ================================
>>>
05/06/15
20150482
GRAFIQUES?MOGENT?S.L.
DIA?5?FECHA?FACTURA
128,42
>>> ================================ RESTART ================================
>>>
30/06/15
20150483
CHIEMIVALL?SL
30?DIAS?FECHA?FACTURA
1.138,58
>>>