java 获取 PDF 中的确切字符串位置

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13632541/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 13:27:08  来源:igfitidea点击:

Get the exact Stringposition in PDF

javapdf

提问by Fendrix

I tried to read a stream and was hoping to get for each String the exact position (coordinates)

我试图读取一个流,并希望为每个字符串获取准确的位置(坐标)

    int size = reader.getXrefSize();

    for (int i = 0; i < size; ++i)
    {
        PdfObject pdfObject = reader.getPdfObject(i);
        if ((pdfObject == null) || !pdfObject.isStream())
            continue;

        PdfStream stream = (PdfStream) pdfObject;
        PdfObject obj = stream.get(PdfName.FILTER);

        if ((obj != null) && obj.toString().equals(PdfName.FLATEDECODE.toString()))
        {
            byte[] codedText = PdfReader.getStreamBytesRaw((PRStream) stream);
            byte[] text = PdfReader.FlateDecode(codedText);
            FileOutputStream o = new FileOutputStream(new File("/home..../Text" + i + ".txt"));
            o.write(text);
            o.flush();
            o.close();
        }

    }

I actually got the position like

我实际上得到了这样的职位

......
BT                  
70.9 800.9 Td /F1 14 Tf <01> Tj 
10.1 0 Td <02> Tj               
9.3 0 Td <03> Tj
3.9 0 Td <01> Tj
10.1 0 Td <0405> Tj
18.7 0 Td <060607> Tj
21 0 Td <08090A07> Tj
24.9 0 Td <05> Tj
10.1 0 Td <0B0C0D> Tj
28.8 0 Td <0E> Tj
3.8 0 Td <0F> Tj
8.6 0 Td <090B1007> Tj
29.5 0 Td <0B11> Tj
16.4 0 Td <12> Tj
7.8 0 Td <1307> Tj
12.4 0 Td <14> Tj
7.8 0 Td <07> Tj
3.9 0 Td <15> Tj
7.8 0 Td <16> Tj
7.8 0 Td <07> Tj
3.9 0 Td <17> Tj
10.8 0 Td <0D> Tj
7.8 0 Td <18> Tj
10.9 0 Td <19> Tj
ET
.....

But I don't know which string fits to which position On the other hand in Itext I could just get the plain text with

但我不知道哪个字符串适合哪个位置另一方面,在 Itext 中,我可以使用纯文本

PdfReader reader = new PdfReader(new FileInputStream("/home/....xxx.pdf"));
PdfTextExtractor extract = new PdfTextExtractor(reader);

but of course without any position at all....

但当然根本没有任何位置......

So how can I get the exact position for each text(string,char,...) ?

那么如何获得每个 text(string,char,...) 的确切位置?

回答by mkl

As plinth and David van Driessche already pointed out in their answers, text extration from PDF file is non-trivial. Fortunately the classes in the parser package of iText do most of the heavy lifting for you. You have already found at least one class from that package,PdfTextExtractor,but this class essentially is a convenience utility for using the parser functionality of iText if you're only interested in the plain text of the page. In your case you have to look at the classes in that package more intensely.

正如 plinth 和 David van Driessche 已经在他们的回答中指出的那样,从 PDF 文件中提取文本并非易事。幸运的是,iText 解析器包中的类为您完成了大部分繁重的工作。您已经从该包中找到了至少一个类,PdfTextExtractor,但是如果您只对页面的纯文本感兴趣,那么这个类本质上是一个使用 iText 解析器功能的便利实用程序。在您的情况下,您必须更仔细地查看该包中的类。

A starting point to get information on the topic of text extraction with iText is section 15.3 Parsing PDFsof iText in Action — 2nd Edition, especially the methodextractTextof the sample ParsingHelloWorld.java:

出发点,以获得与iText的文本提取的主题信息是一款15.3解析PDF文件的iText在行动-第2版,特别是该方法extractText的样本ParsingHelloWorld.java

public void extractText(String src, String dest) throws IOException
{
    PrintWriter out = new PrintWriter(new FileOutputStream(dest));
    PdfReader reader = new PdfReader(src);
    RenderListener listener = new MyTextRenderListener(out);
    PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
    PdfDictionary pageDic = reader.getPageN(1);
    PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
    processor.processContent(ContentByteUtils.getContentBytesForPage(reader, 1), resourcesDic);
    out.flush();
    out.close();
}

which makes use of the RenderListenerimplementation MyTextRenderListener.java:

它使用RenderListener实现MyTextRenderListener.java

public class MyTextRenderListener implements RenderListener
{
    [...]

    /**
     * @see RenderListener#renderText(TextRenderInfo)
     */
    public void renderText(TextRenderInfo renderInfo) {
        out.print("<");
        out.print(renderInfo.getText());
        out.print(">");
    }
}

While thisRenderListenerimplementation merely outputs the text, the TextRenderInfoobject it inspects offers way more information:

虽然此RenderListener实现仅输出文本,但它检查的TextRenderInfo对象提供了更多信息:

public LineSegment getBaseline();    // the baseline for the text (i.e. the line that the text 'sits' on)
public LineSegment getAscentLine();  // the ascentline for the text (i.e. the line that represents the topmost extent that a string of the current font could have)
public LineSegment getDescentLine(); // the descentline for the text (i.e. the line that represents the bottom most extent that a string of the current font could have)
public float getRise()             ; // the rise which  represents how far above the nominal baseline the text should be rendered

public String getText();             // the text to render
public int getTextRenderMode();      // the text render mode
public DocumentFont getFont();       // the font
public float getSingleSpaceWidth();  // the width, in user space units, of a single space character in the current font

public List<TextRenderInfo> getCharacterRenderInfos(); // details useful if a listener needs access to the position of each individual glyph in the text render operation

Thus, if yourRenderListenerin addition to inspecting the text withgetText()also considersgetBaseline()or evengetAscentLine()andgetDescentLine().you have all the coordinates you will likely need.

因此,如果您RenderListener除了检查文本之外getText()还考虑getBaseline()甚至getAscentLine()并且getDescentLine().您拥有您可能需要的所有坐标。

PS:There is a wrapper class for the code inParsingHelloWorld.extractText(), PdfReaderContentParser, which allows you to simply write the following given aPdfReader reader,anint page,and aRenderListener renderListener:

PS:有是在代码的包装类ParsingHelloWorld.extractText()PdfReaderContentParser,它允许您只需编写以下给出PdfReader reader,int page,RenderListener renderListener:

PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(page, renderListener);

回答by plinth

If you're trying to do text extraction, you should be aware that the this is decidedly a non-trivial process. You will, at a minimum, have to implement an RPN machine to run the code and accumulate transformations and execute all the text operators. You will need to interpret the font metrics from the current set of page resources and you will likely need to understand the text encoding.

如果您正在尝试进行文本提取,您应该意识到这绝对是一个非常重要的过程。您至少必须实现一个 RPN 机器来运行代码并累积转换并执行所有文本操作符。您将需要从当前的页面资源集中解释字体度量,并且您可能需要了解文本编码。

When I worked on Acrobat 1.0, I was responsible for the "Find..." command which included your problem as a subset. With a richer set of tools and more expertise, it took a couple months to get it right.

当我在 Acrobat 1.0 上工作时,我负责“查找...”命令,该命令将您的问题作为一个子集。有了更丰富的工具和更多的专业知识,花了几个月的时间才把它做好。

回答by David van Driessche

If you want to understand what the bytes are you're seeing for the Tj operator, have a look at the PDF specification: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf

如果您想了解 Tj 运算符看到的字节是什么,请查看 PDF 规范:http: //www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/ PDF32000_2008.pdf

More specifically - look at section 9.4.3. To paraphrase that section - each byte or potentially sequence of multiple bytes must be looked up in the font used to paint the text (in your example the font is identified as /F1). By looking it up you'll find the actual character this code refers to.

更具体地说 - 查看第 9.4.3 节。解释该部分 - 必须在用于绘制文本的字体中查找每个字节或多个字节的潜在序列(在您的示例中,字体标识为 /F1)。通过查找,您将找到此代码所指的实际字符。

Also keep in mind that the order in which you see these text commands here might not reflect natural reading order at all, so you'll have to figure out based on the positions you find what actually the correct order of these characters is.

还要记住,您在此处看到这些文本命令的顺序可能根本无法反映自然的阅读顺序,因此您必须根据位置找出这些字符的实际正确顺序。

Also keep in mind that your PDF file might not contain spaces for example. Since a space can be "faked" by simply moving the next character a bit to the right, some PDF generators omit spaces. But finding a gap in coordinates might not be a word break. It could also be the end of a column for example.

还要记住,例如,您的 PDF 文件可能不包含空格。由于只需将下一个字符向右移动一点就可以“伪造”空格,因此一些 PDF 生成器会省略空格。但是在坐标中找到差距可能不是一个词中断。例如,它也可以是列的结尾。

This is really, really hard - especially if you are trying to do this on generic PDF files (as opposed to for only a few layouts that you know always come from the same source). I've written a text editor for PDF long ago for a product called PitStop Pro that is still around (no longer affiliated with it) and it was a really hard problem.

这真的非常困难 - 特别是如果您尝试在通用 PDF 文件上执行此操作(而不是仅针对您知道始终来自同一来源的少数布局)。很久以前,我为一个名为 PitStop Pro 的产品编写了一个 PDF 文本编辑器,该产品仍然存在(不再隶属于它),这是一个非常困难的问题。

If that is an option, try using an existing library or tool. There are certainly commercial options for such a library or tool; I'm less familiar with open-source / free libraries so I can't comment on that.

如果这是一个选项,请尝试使用现有的库或工具。这样的库或工具当然有商业选择;我对开源/免费库不太熟悉,所以我不能对此发表评论。