使用python解析pdf
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18755412/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
parse a pdf using python
提问by IcyFlame
I have a pdf file. It contains of four columns and all the pages don't have grid lines. They are the marks of students.
我有一个pdf文件。它包含四列,所有页面都没有网格线。它们是学生的标记。
I would like to run some analysis on this distribution.(histograms, line graphs etc).
我想对这个分布进行一些分析。(直方图、折线图等)。
I want to parse this pdf file into a Spreadsheet or an HTML file (which i can then parse very easily).
我想将此 pdf 文件解析为电子表格或 HTML 文件(然后我可以很容易地解析)。
The link to the pdf is:
pdf的链接是:
this is a public document and is available on this domain openly to anyone.
这是一份公共文件,任何人都可以在此域上公开获取。
note: I know that this can be done by exporting the file to text from adobe reader and then import it into Libre Calc or Excel. But i want to do this using a python script.
注意:我知道这可以通过将文件从 adobe reader 导出为文本,然后将其导入 Libre Calc 或 Excel 来完成。但我想使用 python 脚本来做到这一点。
Kindly help me with this issue. specs: Windows 7 Python 2.7
请帮我解决这个问题。规格:Windows 7 Python 2.7
采纳答案by Burhan Khalid
Use PyPDF2
:
使用PyPDF2
:
from PyPDF2 import PdfFileReader
with open('CT1-All.pdf', 'rb') as f:
reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')
pass
When you print contents
, it will look like this (I have trimmed it here):
当你打印时contents
,它看起来像这样(我在这里修剪了它):
[u'Serial NoRoll NoNameCT1 Marks (50)111MA20026KARADI KALYANI212AR10029MUKESH K
MAR5', u'312MI31004DEEPAK KUMAR7', u'413AE10008FADKE PRASAD DIPAK27', u'513AE10
22RAHUL DUHAN37', u'613AE30005HIMANSHU PRABHAT26.5', u'713AE30019VISHAL KUMAR39
, u'813AG10014HEMANT17', u'913AG10028SHRESTH KR KRISHNA37.51013AG30009HITESH ME
RA33.5', u'1113AG30023RACHIT MADHUKAR40.5', u'1213AR10002ACHARY SUDHEER11', u'1
13AR10004AMAN ASHISH20.5', u'1413AR10008ANKUR44', u'1513AR10010CHUKKA SHALEM RA
U11.5', u'1613AR10012DIKKALA VIJAYA RAGHAVA20.5', u'1713AR10014HRISHABH AMRODIA
1', u'1813AR10016JAPNEET SINGH CHAHAL19.5', u'1913AR10018K VIGNESH42.5', u'2013
R10020KAARTIKEY DWIVEDI49.5', u'2113AR10024LAKSHMISRI KEERTI MANNEY49', u'2213A
10026MAJJI DINESH9.5', u'2313AR10028MOUNIKA BHUKYA17.5', u'2413AR10030PARAS PRA