C# iTextSharp - 如何获取单词在页面上的位置
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2375674/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
iTextSharp - How to get the position of word on a page
提问by Dave
I am using iTextSharp and the reader.GetPageContent method to pull the text out of a PDF. I need to find the rectangle/position for each word found in the document. Is there any way to get the rectangle/position of a word in a PDF using iTextSharp?
我正在使用 iTextSharp 和 reader.GetPageContent 方法从 PDF 中提取文本。我需要为文档中找到的每个单词找到矩形/位置。有没有办法使用 iTextSharp 在 PDF 中获取单词的矩形/位置?
回答by Mark Storer
Yes there is. Check out the text.pdf.parser
package, specifically LocationTextExtractionStrategy
. Actually, that might not do the trick either. You'll probably want to write your own TextExtractionStrategy
to feed into PdfTextExtractor:
就在这里。检查text.pdf.parser
包,特别是LocationTextExtractionStrategy
. 实际上,这也可能不起作用。您可能想要自己编写TextExtractionStrategy
以输入 PdfTextExtractor:
MyTexExStrat strat = new MyTexExStrat();
PdfTextExtractor.getTextFromPage(reader, pageNum, strat);
// get the strings-n-rects from strat.
public class MyTexExStrat implements TextExtractionStrategy {
void beginTextBlock() {}
void endTextBlock() {}
void renderImage(ImageRenderInfo info) {}
void renderText(TextRenderInfo info) {
// track text and location here.
}
}
You'll probably want to look at the source for LocationTextExtractionStrategy to see how it combines text that shares a baseline. You might even just modify LTES to store parallel arrays of strings and rects.
您可能希望查看 LocationTextExtractionStrategy 的源代码,以了解它如何组合共享基线的文本。您甚至可以修改 LTES 来存储字符串和矩形的并行数组。
PS: to build the rects, you can just get the AscentLine & DescentLine and use those coordinates as the top and bottom corners:
PS:要构建矩形,您只需获取 AscentLine 和 DescentLine 并将这些坐标用作顶角和底角:
Vector bottomLeft = info.getDescentLine().getStartPoint();
Vector topRight = info.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(bottomLeft.get(Vector.I1),
bottomLeft.get(Vector.I2),
topRight.get(Vector.I1),
topRight.get(Vector.I2));
Warning: The above code ass-u-mes that the text is horizontal and proceeds from left to right. Rotated text will screw it up, as will vertical text or right-to-left (Arabic, Hebrew) text. For most applications, the above should be fine, but know it's limits.
警告:上面的代码 ass-u-mes 文本是水平的并且从左到右进行。旋转文本会搞砸,垂直文本或从右到左(阿拉伯语、希伯来语)的文本也是如此。对于大多数应用程序,以上应该没问题,但要知道它的局限性。
Good hunting.
打猎不错。