java 如何使用pdfbox提取文本内容的字体样式?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6939583/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract font styles of text contents using pdfbox?
提问by Master Stroke
I am using pdfbox library to extract text contents from pdf file.I would able to extract all the text,but couldn't find the method to extract font styles.
我正在使用pdfbox库从pdf文件中提取文本内容。我可以提取所有文本,但找不到提取字体样式的方法。
回答by Harpreet
This is not the right way to extract font. To read font one has to iterate through pdf pages and extract font as below:
这不是提取字体的正确方法。要阅读字体,必须遍历 pdf 页面并提取字体,如下所示:
PDDocument doc = PDDocument.load("C:/mydoc3.pdf");
List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
for(PDPage page:pages){
Map<String,PDFont> pageFonts=page.getResources().getFonts();
}
回答by Master Stroke
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class pdf2box {
public static void main(String args[])
{
try
{
PDDocument pddDocument=PDDocument.load("table2.pdf");
PDFTextStripper textStripper=new PDFTextStripper();
System.out.println(textStripper.getText(pddDocument));
textStripper.getFonts();
pddDocument.close();
}
catch(Exception ex)
{
ex.printStackTrace();
}
}
}
回答by Walid Bousseta
File file = new File("sample.pdf");
PDDocument document = PDDocument.load(file);
for (int i = 0; i < document.getNumberOfPages(); ++i)
{
PDPage page = document.getPage(i);
PDResources res = page.getResources();
for (COSName fontName : res.getFontNames())
{
PDFont font = res.getFont(fontName);
System.out.println(font.getName());
}
}