Java 使用 iText 从 pdf 文件中提取文本列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4028240/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract columns of text from a pdf file using iText
提问by Rim
I need to extract text from pdf files using iText.
我需要使用 iText 从 pdf 文件中提取文本。
The problem is: some pdf files contain 2 columns and when I extract text I get a text file where columns are merged as the result (i.e. text from both columns in the same line)
问题是:一些 pdf 文件包含 2 列,当我提取文本时,我得到一个文本文件,其中的列被合并为结果(即同一行中两列的文本)
this is the code:
这是代码:
public class pdf
{
private static String INPUTFILE = "http://www.revuemedecinetropicale.com/TAP_519-522_-_AO_07151GT_Rasoamananjara__ao.pdf" ;
private static String OUTPUTFILE = "c:/new3.pdf";
public static void main(String[] args) throws DocumentException, IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {
page = writer.getImportedPage(reader, i);
Image instance = Image.getInstance(page);
document.add(instance);
}
document.close();
PdfReader readerN = new PdfReader(OUTPUTFILE);
for (int i = 1; i <= n; i++) {
String myLine = PdfTextExtractor.getTextFromPage(readerN,i);
System.out.println(myLine);
try {
FileWriter fw = new FileWriter("c:/yo.txt",true);
fw.write(myLine);
fw.close();
}catch (IOException ioe) {ioe.printStackTrace(); }
}
}
Could you please help me with the task?
你能帮我完成任务吗?
采纳答案by Kevin Day
I am the author of the iText text extraction sub-system. What you need to do is develop your own text extraction strategy (if you look at how PdfTextExtractor.getTextFromPage
is implemented, you will see that you can provide a pluggable strategy).
我是 iText 文本提取子系统的作者。你需要做的是开发你自己的文本提取策略(如果你看看是如何PdfTextExtractor.getTextFromPage
实现的,你会发现你可以提供一个可插入的策略)。
How you are going to determine where columns start and stop is entirely up to you - this is a difficult problem - PDF doesn't have any concept of columns (heck, it doesn't even have a concept of words - just putting together the text extraction that the default strategy provides is quite tricky). If you know in advanced where the columns are, then you can use a region filter on the text render listener callback (there is code in the iText library for doing this, and the latest version of the iText In Action book gives a detailed example).
您将如何确定列的开始和停止位置完全取决于您 - 这是一个难题 - PDF 没有任何列的概念(哎呀,它甚至没有单词的概念 - 只是将默认策略提供的文本提取非常棘手)。如果您预先知道列的位置,那么您可以在文本渲染侦听器回调上使用区域过滤器(iText 库中有代码用于执行此操作,最新版本的 iText In Action 书籍给出了详细示例) .
If you need to obtain columns from arbitrary data, you've got some algorithm work ahead of you (if you get something working, I'd love to take a look). Some ideas on how to approach this:
如果您需要从任意数据中获取列,那么您需要先完成一些算法工作(如果您有一些工作,我很想看看)。关于如何解决这个问题的一些想法:
- Use an algorithm similar to that used in the default text extraction strategy (LocationAware...) to obtain a list of words and X/Y locations (be sure to account for rotation angle as well)
- For each word, draw an imaginary line running the full height of the page. Scan for all other words that start at the same X position.
- While scanning, also look for words that intersect the X position (but do not start on the X position). This will give you potential location for column start/stop Y positions on the page.
- Once you have column X and Y, you can resort to a region filtered approach
- 使用与默认文本提取策略 (LocationAware...) 中使用的算法类似的算法来获取单词列表和 X/Y 位置(确保同时考虑旋转角度)
- 对于每个单词,绘制一条贯穿整个页面高度的假想线。扫描以相同 X 位置开始的所有其他单词。
- 扫描时,还要查找与 X 位置相交的单词(但不要从 X 位置开始)。这将为您提供页面上列开始/停止 Y 位置的潜在位置。
- 一旦你有了 X 和 Y 列,你就可以采用区域过滤的方法
Another approach that may be equally feasible would be to analyze draw operations and look for long horizontal and vertical lines (assuming the columns are demarcated in a table-like format). Right now, the iText content parser doesn't have callbacks for these operations, but it would be possible to add them without major difficulty.
另一种同样可行的方法是分析绘制操作并寻找长的水平线和垂直线(假设列以类似表格的格式进行划分)。目前,iText 内容解析器没有针对这些操作的回调,但可以毫无困难地添加它们。
回答by Andrew Cash
The file you are extracting from is pretty complex for data extraction purposes. There are tables, images, multiple, columns. You will need special algorithms to determine the reading order and also process the table data.
您要从中提取的文件对于数据提取而言非常复杂。有表格,图像,多个,列。您将需要特殊算法来确定读取顺序并处理表数据。
What are you trying to achieve here ? You could use a commercial OCR engine instead and let it do all the hard work and then process the data from there.
你想在这里实现什么?您可以改用商业 OCR 引擎,让它完成所有繁重的工作,然后从那里处理数据。
回答by mark stephens
Tables do not exist as structures in PDF unless the file uses Structured content. Do you understand what a PDF file is? I wrote a blog article explaining the issues of text extraction at http://www.jpedal.org/PDFblog/?p=228
除非文件使用结构化内容,否则表格在 PDF 中不作为结构存在。你知道什么是PDF文件吗?我在http://www.jpedal.org/PDFblog/?p=228写了一篇博客文章,解释了文本提取的问题
回答by mark stephens
You could also try PdfBox, but it all goes back to lack of structure in the PDF - its primarily an end file output format for display.
您也可以尝试 PdfBox,但这一切都可以追溯到 PDF 中缺乏结构 - 它主要是用于显示的最终文件输出格式。
回答by PhDeveloper
I know my answer is a bit late. But I'm using the following code to read certain pages from PDF files. I didn't have any problem reading columns, no merged text, each column is being printed aside from the other.
我知道我的回答有点晚了。但我使用以下代码从 PDF 文件中读取某些页面。我在阅读列时没有任何问题,没有合并文本,每一列都与另一列分开打印。
/**
* Get plain text from a specific page in a pdf file.
* @param pdfPath
* @return
* @throws IOException
*/
public static String getPageContent(String pdfPath, int pageNumber) throws IOException
{
PdfReader reader = new PdfReader(pdfPath);
StringWriter output = new StringWriter();
try {
output.append(PdfTextExtractor.getTextFromPage(reader, pageNumber, new SimpleTextExtractionStrategy()));
} catch (OutOfMemoryError e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return output.toString();
}
If you are looking into extracting part of a page, let's say 1 column only, then you need to get the dimensions of the column. It's still a bit tricky but you might be able to figure this out if you already knew the begining text of the column (in a way to estimate the width and height). This can be done by using a rectangular area. See code below, and sorry if I got the point measurement wrong. In the code below I try to get the whole page dimension.
如果您正在考虑提取页面的一部分,假设只有 1 列,那么您需要获取该列的尺寸。这仍然有点棘手,但如果您已经知道列的开始文本(以估计宽度和高度的方式),您可能能够弄清楚这一点。这可以通过使用矩形区域来完成。请参阅下面的代码,如果我的点测量错误,抱歉。在下面的代码中,我尝试获取整个页面尺寸。
public static String getPageContent(String pdfPath, int pageNumber) throws IOException
{
PDDocument pdDoc = PDDocument.load(pdfPath);
PDPage specPage = (PDPage)pdDoc.getDocumentCatalog().getAllPages().get( 0 );
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
float width = (specPage.getMediaBox().getHeight())*25.4f;
float height = (specPage.getMediaBox().getWidth())*25.4f;
Rectangle rect = new Rectangle( 0, 0, Math.round(width), Math.round(height));
stripper.addRegion( "class1", rect );
List allPages = pdDoc.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( pageNumber-1 );
stripper.extractRegions( firstPage );
return stripper.getTextForRegion( "class1" );
}
}
回答by Darpan27
PDFTextStreamis the one! At least I am able to identify the column values. Earlier, I was using iText and got stuck in defining strategy. Its hard.
PDFTextStream就是其中之一!至少我能够识别列值。早些时候,我使用 iText 并陷入了定义策略的困境。这个很难(硬。
This api separates column cells by putting more spaces. Its fixed. you can put logic. (this was missing in iText).
这个 api 通过放置更多空间来分隔列单元格。它的固定。你可以把逻辑。(这是在 iText 中缺失的)。
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;
public class PDFText {
public static void main(String[] args) throws java.io.IOException {
String pdfFilePath = "xyz.pdf";
Document pdf = PDF.open(pdfFilePath);
StringBuilder text = new StringBuilder(1024);
pdf.pipe(new OutputTarget(text));
pdf.close();
System.out.println(text);
}
}