java iText - 获取文本段的字体大小和系列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10879336/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 02:57:01  来源:igfitidea点击:

iText - Get Font size and family of a text segment

javapdfitexttext-extractionpdf-extraction

提问by Prine

I'm currently trying to automatically extract important keywords from a PDF file. I am able to get the text information out of the PDF document. But now I need to know, which font size and font family these keywords have.

我目前正在尝试从 PDF 文件中自动提取重要的关键字。我能够从 PDF 文档中获取文本信息。但现在我需要知道,这些关键字有哪些字体大小和字体系列。

The following code I already have:

我已经拥有以下代码:

Main

主要的

public static void main(String[] args) throws IOException {
    String src = "SEM_081145.pdf";

    PdfReader reader = new PdfReader(src);

    SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();

    PrintWriter out = new PrintWriter(new FileOutputStream(src + ".txt"));
    Rectangle rect = new Rectangle(70, 80, 490, 580);
    RenderFilter filter = new RegionTextRenderFilter(rect);

    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        // strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
        out.println(PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy));
    }
    out.flush();
    out.close();
}

And I have implemented the TextExtraction Strategy SemTextExtractionStrategywhich looks like this:

我已经实现了如下所示的 TextExtraction 策略SemTextExtractionStrategy

public class SemTextExtractionStrategy implements TextExtractionStrategy {

private String text;

@Override
public void beginTextBlock() {
}

@Override
public void renderText(TextRenderInfo renderInfo) {
    text = renderInfo.getText();

    System.out.println(renderInfo.getFont().getFontType());

    System.out.print(text);
}

@Override
public void endTextBlock() {
}

@Override
public void renderImage(ImageRenderInfo renderInfo) {
}

@Override
public String getResultantText() {
    return text;
}
}

I can get the FontType but there is no method to get the font size. Is there another way or how can I get the font size of the current text segment?

我可以获取 FontType 但没有获取字体大小的方法。有没有另一种方法或如何获取当前文本段的字体大小?

Or are there any other libraries which can fetch out the font size from TextSegments? I already had a look into PDFBox, and PDFTextStream. The PDF Shareware Library from Aspose would perfectly do the job. But it's very expensive and I need to use an open source project.

或者是否有任何其他库可以从 TextSegments 中获取字体大小?我已经看过 PDFBox 和 PDFTextStream。Aspose 的 PDF 共享软件库可以完美地完成这项工作。但它非常昂贵,我需要使用一个开源项目。

采纳答案by Alexis Pigeon

You can adapt the code provided in this answer, in particular this code snippet:

您可以修改此答案中提供的代码,尤其是此代码片段:

Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;

This answer is in C#, but the API is so similar that the conversion to Java should be straightforward.

这个答案是在 C# 中,但 API 非常相似,因此转换为 Java 应该很简单。

回答by Prine

Thanks to Alexis I could convert his C# solution into Java code:

感谢 Alexis,我可以将他的 C# 解决方案转换为 Java 代码:

text = renderInfo.getText();

Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();

Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1), topRight.get(0), topRight.get(1));
float curFontSize = rect.getHeight();

回答by Wilfred Springer

I had some trouble using Alexis' and Prine's solution, since it doesn't deal with rotated text correctly. So this is what I do (sorry, in Scala):

我在使用 Alexis 和 Prine 的解决方案时遇到了一些麻烦,因为它没有正确处理旋转的文本。所以这就是我所做的(抱歉,在 Scala 中):

val x0 = info.getAscentLine.getEndPoint
val x1 = info.getBaseline.getStartPoint
val x2 = info.getBaseline.getEndPoint
val length1 = (x2.subtract(x1)).cross((x1.subtract(x0))).lengthSquared
val length2 = x2.subtract(x1).lengthSquared
(length1, length2) match {
  case (0, 0) => 0
  case _ => length1 / length2
}

回答by KimvdLinde

If you want the exact fontsize, use the following code in your renderText:

如果您想要确切的字体大小,请在您的 renderText 中使用以下代码:

float fontsize = renderInfo.getAscentLine().getStartPoint().get(1)
     - renderInfo.getDescentLine().getStartPoint().get(1);

Modify this as indicated in the other answers for rorated text.

按照旋转文本的其他答案中的指示进行修改。