Java 使用 PDFBox 解析 PDF 文件(尤其是表格)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3203790/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing PDF files (especially with tables) with PDFBox
提问by Matheus Moreira
I need to parse a PDF file which contains tabular data. I'm using PDFBoxto extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contains a table like this (7 columns: the first two always have data, only one Complexity column has data, only one Financing column has data):
我需要解析一个包含表格数据的 PDF 文件。我正在使用PDFBox提取文件文本以稍后解析结果(字符串)。问题是文本提取不像我对表格数据所期望的那样工作。例如,我有一个文件,其中包含一个这样的表(7 列:前两列总是有数据,只有一个 Complexity 列有数据,只有一个 Financing 列有数据):
+----------------------------------------------------------------+
| AIH | Value | Complexity | Financing |
| | | Medium | High | Not applicable | MAC/Other | FAE |
+----------------------------------------------------------------+
| xyz | 12.43 | 12.34 | | | 12.34 | |
+----------------------------------------------------------------+
| abc | 1.56 | | 1.56 | | | 1.56|
+----------------------------------------------------------------+
Then I use PDFBox:
然后我使用 PDFBox:
PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
Those two lines of data would be extracted like this:
这两行数据将像这样提取:
xyz 12.43 12.4312.43
abc 1.56 1.561.56
There are no white spaces between the last two numbers, but this is not the biggest problem. The problem is that I don't know what the last two numbers mean: Medium, High, Not applicable? MAC/Other, FAE? I don't have the relation between the numbers and their columns.
最后两个数字之间没有空格,但这不是最大的问题。问题是我不知道最后两个数字是什么意思:中、高、不适用?MAC/其他,FAE?我没有数字和它们的列之间的关系。
It is not required for me to use the PDFBox library, so a solution that uses another library is fine. What I want is to be able to parse the file and know what each parsed number means.
我不需要使用 PDFBox 库,所以使用另一个库的解决方案就可以了。我想要的是能够解析文件并知道每个解析数字的含义。
回答by Paul Sanwald
回答by Todd Owen
Extracting data from PDF is bound to be fraught with problems. Are the documents created through some kind of automatic process? If so, you might consider converting the PDFs to uncompressed PostScript (try pdf2ps) and seeing if the PostScript contains some sort of regular pattern which you can exploit.
从 PDF 中提取数据必然充满问题。文档是通过某种自动过程创建的吗?如果是这样,您可以考虑将 PDF 转换为未压缩的 PostScript(尝试 pdf2ps)并查看 PostScript 是否包含某种您可以利用的常规模式。
回答by Carl Smotricz
How about printing to image and doing OCR on that?
打印到图像并在其上进行 OCR 怎么样?
Sounds terribly ineffective, but it's practically the very purpose of PDF to make text inaccessible, you gotta do what you gotta do.
听起来非常低效,但实际上 PDF 的真正目的是使文本无法访问,您必须做您必须做的事情。
回答by purecharger
You will need to devise an algorithm to extract the data in a usable format. Regardless of which PDF library you use, you will need to do this. Characters and graphics are drawn by a series of stateful drawing operations, i.e. move to this position on the screen and draw the glyph for character 'c'.
您将需要设计一种算法来以可用格式提取数据。无论您使用哪个 PDF 库,您都需要这样做。字符和图形是通过一系列有状态的绘制操作来绘制的,即移动到屏幕上的这个位置并绘制字符“c”的字形。
I suggest that you extend org.apache.pdfbox.pdfviewer.PDFPageDrawer
and override the strokePath
method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions for your table. Then its a simple matter of setting up text regions and determining which numbers/letters/characters are drawn in which region. Since you know the layout of the regions, you'll be able to tell which column the extracted text belongs to.
我建议您扩展org.apache.pdfbox.pdfviewer.PDFPageDrawer
和覆盖该strokePath
方法。从那里您可以截取水平和垂直线段的绘制操作,并使用该信息来确定表格的列和行位置。然后设置文本区域并确定在哪个区域绘制哪些数字/字母/字符是一个简单的问题。由于您知道区域的布局,您将能够判断提取的文本属于哪一列。
Also, the reason you may not have spaces between text that is visually separated is that very often, a space character is not drawn by the PDF. Instead the text matrix is updated and a drawing command for 'move' is issued to draw the next character and a "space width" apart from the last one.
此外,您在视觉上分隔的文本之间可能没有空格的原因是,通常情况下,PDF 不会绘制空格字符。取而代之的是更新文本矩阵并发出“移动”的绘图命令以绘制下一个字符和与上一个字符分开的“空格宽度”。
Good luck.
祝你好运。
回答by kaushalc
http://swftools.org/these guys have a pdf2swf component. They are also able to show tables. They are also giving the source. So you could possibly check it out.
http://swftools.org/这些家伙有一个 pdf2swf 组件。他们还能够显示表格。他们也给出了来源。所以你可以检查一下。
回答by impeto
It may be too late for my answer, but I think this is not that hard. You can extend the PDFTextStripper class and override the writePage() and processTextPosition(...) methods. In your case I assume that the column headers are always the same. That means that you know the x-coordinate of each column heading and you can compare the the x-coordinate of the numbers to those of the column headings. If they are close enough (you have to test to decide how close) then you can say that that number belongs to that column.
我的回答可能为时已晚,但我认为这并不难。您可以扩展 PDFTextStripper 类并覆盖 writePage() 和 processTextPosition(...) 方法。在您的情况下,我假设列标题始终相同。这意味着您知道每个列标题的 x 坐标,并且可以将数字的 x 坐标与列标题的 x 坐标进行比较。如果它们足够接近(您必须测试以确定接近的程度),那么您可以说该数字属于该列。
Another approach would be to intercept the "charactersByArticle" Vector after each page is written:
另一种方法是在写入每个页面后拦截“charactersByArticle”向量:
@Override
public void writePage() throws IOException {
super.writePage();
final Vector<List<TextPosition>> pageText = getCharactersByArticle();
//now you have all the characters on that page
//to do what you want with them
}
Knowing your columns, you can do your comparison of the x-coordinates to decide what column every number belongs to.
了解您的列后,您可以比较 x 坐标以确定每个数字属于哪一列。
The reason you don't have any spaces between numbers is because you have to set the word separator string.
数字之间没有任何空格的原因是因为您必须设置单词分隔符字符串。
I hope this is useful to you or to others who might be trying similar things.
我希望这对您或其他可能正在尝试类似事情的人有用。
回答by scott
I've had decent success with parsing text files generated by the pdftotextutility (sudo apt-get install poppler-utils).
我在解析由pdftotext实用程序(sudo apt-get install poppler-utils)生成的文本文件方面取得了不错的成功。
File convertPdf() throws Exception {
File pdf = new File("mypdf.pdf");
String outfile = "mytxt.txt";
String proc = "/usr/bin/pdftotext";
ProcessBuilder pb = new ProcessBuilder(proc,"-layout",pdf.getAbsolutePath(),outfile);
Process p = pb.start();
p.waitFor();
return new File(outfile);
}
回答by Emerson Farrugia
You can extract text by area in PDFBox. See the ExtractByArea.java
example file, in the pdfbox-examples
artifact if you're using Maven. A snippet looks like
您可以在 PDFBox 中按区域提取文本。如果您使用的是 Maven ExtractByArea.java
,请参阅pdfbox-examples
工件中的示例文件。一个片段看起来像
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
Rectangle rect = new Rectangle( 464, 59, 55, 5);
stripper.addRegion( "class1", rect );
stripper.extractRegions( page );
String string = stripper.getTextForRegion( "class1" );
The problem is getting the coordinates in the first place. I've had success extending the normal TextStripper
, overriding processTextPosition(TextPosition text)
and printing out the coordinates for each character and figuring out where in the document they are.
问题是首先获取坐标。我已经成功扩展了 normal TextStripper
,覆盖processTextPosition(TextPosition text)
并打印出每个字符的坐标,并找出它们在文档中的位置。
But there's a much simpler way, at least if you're on a Mac. Open the PDF in Preview, ?I to show the Inspector, choose the Crop tab and make sure the units are in Points, from the Tools menu choose Rectangular selection, and select the area of interest. If you select an area, the inspector will show you the coordinates, which you can round and feed into the Rectangle
constructor arguments. You just need to confirm where the origin is, using the first method.
但是有一种更简单的方法,至少如果您使用的是 Mac。在预览中打开 PDF,?I 显示检查器,选择裁剪选项卡并确保单位为点,从工具菜单中选择矩形选择,然后选择感兴趣的区域。如果您选择一个区域,检查器将向您显示坐标,您可以将其四舍五入并输入到Rectangle
构造函数参数中。您只需要使用第一种方法确认原点在哪里。
回答by Abhishek Yadav
For reading content of the table from pdf file,you have to do only just convert the pdf file into a text file by using any API(I have use PdfTextExtracter.getTextFromPage() of iText) and then read that txt file by your java program..now after reading it the major task is done.. you have to filter the data of your need. you can do it by continuously using split method of String class until you find record of your intrest.. here is my code by which I have extract part of record by an PDF file and write it into a .CSV file.. Url of PDF file is..http://www.cea.nic.in/reports/monthly/generation_rep/actual/jan13/opm_02.pdf
要从pdf文件中读取表格的内容,您只需使用任何API(我使用iText的PdfTextExtracter.getTextFromPage())将pdf文件转换为文本文件,然后通过您的java程序读取该txt文件..现在阅读后主要任务完成了..你必须过滤你需要的数据。你可以通过连续使用 String 类的 split 方法来做到这一点,直到你找到你感兴趣的记录..这是我的代码,我通过PDF文件提取部分记录并将其写入.CSV文件..PDF的网址文件是.. http://www.cea.nic.in/reports/monthly/generation_rep/actual/jan13/opm_02.pdf
Code:-
代码:-
public static void genrateCsvMonth_Region(String pdfpath, String csvpath) {
try {
String line = null;
// Appending Header in CSV file...
BufferedWriter writer1 = new BufferedWriter(new FileWriter(csvpath,
true));
writer1.close();
// Checking whether file is empty or not..
BufferedReader br = new BufferedReader(new FileReader(csvpath));
if ((line = br.readLine()) == null) {
BufferedWriter writer = new BufferedWriter(new FileWriter(
csvpath, true));
writer.append("REGION,");
writer.append("YEAR,");
writer.append("MONTH,");
writer.append("THERMAL,");
writer.append("NUCLEAR,");
writer.append("HYDRO,");
writer.append("TOTAL\n");
writer.close();
}
// Reading the pdf file..
PdfReader reader = new PdfReader(pdfpath);
BufferedWriter writer = new BufferedWriter(new FileWriter(csvpath,
true));
// Extracting records from page into String..
String page = PdfTextExtractor.getTextFromPage(reader, 1);
// Extracting month and Year from String..
String period1[] = page.split("PEROID");
String period2[] = period1[0].split(":");
String month[] = period2[1].split("-");
String period3[] = month[1].split("ENERGY");
String year[] = period3[0].split("VIS");
// Extracting Northen region
String northen[] = page.split("NORTHEN REGION");
String nthermal1[] = northen[0].split("THERMAL");
String nthermal2[] = nthermal1[1].split(" ");
String nnuclear1[] = northen[0].split("NUCLEAR");
String nnuclear2[] = nnuclear1[1].split(" ");
String nhydro1[] = northen[0].split("HYDRO");
String nhydro2[] = nhydro1[1].split(" ");
String ntotal1[] = northen[0].split("TOTAL");
String ntotal2[] = ntotal1[1].split(" ");
// Appending filtered data into CSV file..
writer.append("NORTHEN" + ",");
writer.append(year[0] + ",");
writer.append(month[0] + ",");
writer.append(nthermal2[4] + ",");
writer.append(nnuclear2[4] + ",");
writer.append(nhydro2[4] + ",");
writer.append(ntotal2[4] + "\n");
// Extracting Western region
String western[] = page.split("WESTERN");
String wthermal1[] = western[1].split("THERMAL");
String wthermal2[] = wthermal1[1].split(" ");
String wnuclear1[] = western[1].split("NUCLEAR");
String wnuclear2[] = wnuclear1[1].split(" ");
String whydro1[] = western[1].split("HYDRO");
String whydro2[] = whydro1[1].split(" ");
String wtotal1[] = western[1].split("TOTAL");
String wtotal2[] = wtotal1[1].split(" ");
// Appending filtered data into CSV file..
writer.append("WESTERN" + ",");
writer.append(year[0] + ",");
writer.append(month[0] + ",");
writer.append(wthermal2[4] + ",");
writer.append(wnuclear2[4] + ",");
writer.append(whydro2[4] + ",");
writer.append(wtotal2[4] + "\n");
// Extracting Southern Region
String southern[] = page.split("SOUTHERN");
String sthermal1[] = southern[1].split("THERMAL");
String sthermal2[] = sthermal1[1].split(" ");
String snuclear1[] = southern[1].split("NUCLEAR");
String snuclear2[] = snuclear1[1].split(" ");
String shydro1[] = southern[1].split("HYDRO");
String shydro2[] = shydro1[1].split(" ");
String stotal1[] = southern[1].split("TOTAL");
String stotal2[] = stotal1[1].split(" ");
// Appending filtered data into CSV file..
writer.append("SOUTHERN" + ",");
writer.append(year[0] + ",");
writer.append(month[0] + ",");
writer.append(sthermal2[4] + ",");
writer.append(snuclear2[4] + ",");
writer.append(shydro2[4] + ",");
writer.append(stotal2[4] + "\n");
// Extracting eastern region
String eastern[] = page.split("EASTERN");
String ethermal1[] = eastern[1].split("THERMAL");
String ethermal2[] = ethermal1[1].split(" ");
String ehydro1[] = eastern[1].split("HYDRO");
String ehydro2[] = ehydro1[1].split(" ");
String etotal1[] = eastern[1].split("TOTAL");
String etotal2[] = etotal1[1].split(" ");
// Appending filtered data into CSV file..
writer.append("EASTERN" + ",");
writer.append(year[0] + ",");
writer.append(month[0] + ",");
writer.append(ethermal2[4] + ",");
writer.append(" " + ",");
writer.append(ehydro2[4] + ",");
writer.append(etotal2[4] + "\n");
// Extracting northernEastern region
String neestern[] = page.split("NORTH");
String nethermal1[] = neestern[2].split("THERMAL");
String nethermal2[] = nethermal1[1].split(" ");
String nehydro1[] = neestern[2].split("HYDRO");
String nehydro2[] = nehydro1[1].split(" ");
String netotal1[] = neestern[2].split("TOTAL");
String netotal2[] = netotal1[1].split(" ");
writer.append("NORTH EASTERN" + ",");
writer.append(year[0] + ",");
writer.append(month[0] + ",");
writer.append(nethermal2[4] + ",");
writer.append(" " + ",");
writer.append(nehydro2[4] + ",");
writer.append(netotal2[4] + "\n");
writer.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
回答by manu
I had the same problem in reading the pdf file in which data is in tabular format. After regular parse using PDFBox each row were extracted with comma as a separator... losing the columnar position. To resolve this I used PDFTextStripperByArea and using coordinates I extracted the data column by column for each row. This is provided that you have a fixed format pdf.
我在读取数据为表格格式的 pdf 文件时遇到了同样的问题。在使用 PDFBox 进行常规解析后,每一行都用逗号作为分隔符提取......失去了柱状位置。为了解决这个问题,我使用了 PDFTextStripperByArea 并使用坐标为每一行逐列提取了数据。前提是您有固定格式的pdf。
File file = new File("fileName.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
Rectangle rect1 = new Rectangle( 50, 140, 60, 20 );
Rectangle rect2 = new Rectangle( 110, 140, 20, 20 );
stripper.addRegion( "row1column1", rect1 );
stripper.addRegion( "row1column2", rect2 );
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 2 );
stripper.extractRegions( firstPage );
System.out.println(stripper.getTextForRegion( "row1column1" ));
System.out.println(stripper.getTextForRegion( "row1column2" ));
Then row 2 and so on...
然后第 2 行等等......