C# iTextSharp - 如何获取单词在页面上的位置

Question

提问by Dave

I am using iTextSharp and the reader.GetPageContent method to pull the text out of a PDF. I need to find the rectangle/position for each word found in the document. Is there any way to get the rectangle/position of a word in a PDF using iTextSharp?

我正在使用 iTextSharp 和 reader.GetPageContent 方法从 PDF 中提取文本。我需要为文档中找到的每个单词找到矩形/位置。有没有办法使用 iTextSharp 在 PDF 中获取单词的矩形/位置？

Answer 1

回答by Mark Storer

Yes there is. Check out the text.pdf.parserpackage, specifically LocationTextExtractionStrategy. Actually, that might not do the trick either. You'll probably want to write your own TextExtractionStrategyto feed into PdfTextExtractor:

就在这里。检查text.pdf.parser包，特别是LocationTextExtractionStrategy. 实际上，这也可能不起作用。您可能想要自己编写TextExtractionStrategy以输入 PdfTextExtractor：

MyTexExStrat strat = new MyTexExStrat();
PdfTextExtractor.getTextFromPage(reader, pageNum, strat);
// get the strings-n-rects from strat.

public class MyTexExStrat implements TextExtractionStrategy {
    void beginTextBlock() {}
    void endTextBlock() {}
    void renderImage(ImageRenderInfo info) {}
    void renderText(TextRenderInfo info) {
      // track text and location here.
    }
}

You'll probably want to look at the source for LocationTextExtractionStrategy to see how it combines text that shares a baseline. You might even just modify LTES to store parallel arrays of strings and rects.

您可能希望查看 LocationTextExtractionStrategy 的源代码，以了解它如何组合共享基线的文本。您甚至可以修改 LTES 来存储字符串和矩形的并行数组。

PS: to build the rects, you can just get the AscentLine & DescentLine and use those coordinates as the top and bottom corners:

PS：要构建矩形，您只需获取 AscentLine 和 DescentLine 并将这些坐标用作顶角和底角：

Vector bottomLeft = info.getDescentLine().getStartPoint();
Vector topRight = info.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(bottomLeft.get(Vector.I1),
                               bottomLeft.get(Vector.I2),
                               topRight.get(Vector.I1),
                               topRight.get(Vector.I2));

Warning: The above code ass-u-mes that the text is horizontal and proceeds from left to right. Rotated text will screw it up, as will vertical text or right-to-left (Arabic, Hebrew) text. For most applications, the above should be fine, but know it's limits.

警告：上面的代码 ass-u-mes 文本是水平的并且从左到右进行。旋转文本会搞砸，垂直文本或从右到左（阿拉伯语、希伯来语）的文本也是如此。对于大多数应用程序，以上应该没问题，但要知道它的局限性。

Good hunting.

打猎不错。

C# iTextSharp - 如何获取单词在页面上的位置

提问by Dave

回答by Mark Storer

相关推荐

最近更新

标签

C# iTextSharp - 如何获取单词在页面上的位置

提问by Dave

回答by Mark Storer

相关推荐

Linux Ksh 和 if 语句

如何列出包括 LD_LIBRARY_PATH 在内的所有 Linux 环境变量

C# 如何使用 .NET HttpWebRequest API 从响应中读取 HTTP 标头？

如何在 $PATH 变量 linux 中添加多个路径？

相关推荐

最近更新

标签