C# iTextSharp - 如何获取单词在页面上的位置

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2375674/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-07 01:54:16  来源:igfitidea点击:

iTextSharp - How to get the position of word on a page

c#pdfitextsharp

提问by Dave

I am using iTextSharp and the reader.GetPageContent method to pull the text out of a PDF. I need to find the rectangle/position for each word found in the document. Is there any way to get the rectangle/position of a word in a PDF using iTextSharp?

我正在使用 iTextSharp 和 reader.GetPageContent 方法从 PDF 中提取文本。我需要为文档中找到的每个单词找到矩形/位置。有没有办法使用 iTextSharp 在 PDF 中获取单词的矩形/位置?

回答by Mark Storer

Yes there is. Check out the text.pdf.parserpackage, specifically LocationTextExtractionStrategy. Actually, that might not do the trick either. You'll probably want to write your own TextExtractionStrategyto feed into PdfTextExtractor:

就在这里。检查text.pdf.parser包,特别是LocationTextExtractionStrategy. 实际上,这也可能不起作用。您可能想要自己编写TextExtractionStrategy以输入 PdfTextExtractor:

MyTexExStrat strat = new MyTexExStrat();
PdfTextExtractor.getTextFromPage(reader, pageNum, strat);
// get the strings-n-rects from strat.

public class MyTexExStrat implements TextExtractionStrategy {
    void beginTextBlock() {}
    void endTextBlock() {}
    void renderImage(ImageRenderInfo info) {}
    void renderText(TextRenderInfo info) {
      // track text and location here.
    }
}

You'll probably want to look at the source for LocationTextExtractionStrategy to see how it combines text that shares a baseline. You might even just modify LTES to store parallel arrays of strings and rects.

您可能希望查看 LocationTextExtractionStrategy 的源代码,以了解它如何组合共享基线的文本。您甚至可以修改 LTES 来存储字符串和矩形的并行数组。

PS: to build the rects, you can just get the AscentLine & DescentLine and use those coordinates as the top and bottom corners:

PS:要构建矩形,您只需获取 AscentLine 和 DescentLine 并将这些坐标用作顶角和底角:

Vector bottomLeft = info.getDescentLine().getStartPoint();
Vector topRight = info.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(bottomLeft.get(Vector.I1),
                               bottomLeft.get(Vector.I2),
                               topRight.get(Vector.I1),
                               topRight.get(Vector.I2));

Warning: The above code ass-u-mes that the text is horizontal and proceeds from left to right. Rotated text will screw it up, as will vertical text or right-to-left (Arabic, Hebrew) text. For most applications, the above should be fine, but know it's limits.

警告:上面的代码 ass-u-mes 文本是水平的并且从左到右进行。旋转文本会搞砸,垂直文本或从右到左(阿拉伯语、希伯来语)的文本也是如此。对于大多数应用程序,以上应该没问题,但要知道它的局限性。

Good hunting.

打猎不错。