java PDF到XML转换的java代码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16936013/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
java code for PDF to XML conversion
提问by nikhil
i tried with itext and could not get xml format
i want to convert pdf to xml
i just want text(location,size) which is available in XMLformat and so any one help ne in conversion of PDFto XMLusing java
我尝试使用 itext 但无法获得 xml 格式
我想将 pdf 转换为 xml
我只想要文本(位置,大小),它以XML格式提供,因此任何人都可以使用 java 将PDF转换为XML
回答by Liping Huang
There is a library pdf2htmlEXwhich can convert the pdf to html without losing text or format.
有一个库pdf2htmlEX可以将 pdf 转换为 html 而不会丢失文本或格式。
Hope this can help you.
希望这可以帮到你。
回答by Swayam
This is the code I use in my own applications. I don't remember where I got it from, but it sure works like a charm.
这是我在自己的应用程序中使用的代码。我不记得我从哪里得到它,但它确实像一种魅力。
public class ConvertPDFToXML {
static StreamResult streamResult;
static TransformerHandler handler;
static AttributesImpl atts;
public static void main(String[] args) throws IOException {
try {
Document document = new Document();
document.open();
PdfReader reader = new PdfReader("C:\hello.pdf");
PdfDictionary page = reader.getPageN(1);
PRIndirectReference objectReference = (PRIndirectReference) page
.get(PdfName.CONTENTS);
PRStream stream = (PRStream) PdfReader
.getPdfObject(objectReference);
byte[] streamBytes = PdfReader.getStreamBytes(stream);
PRTokeniser tokenizer = new PRTokeniser(streamBytes);
StringBuffer strbufe = new StringBuffer();
while (tokenizer.nextToken()) {
if (tokenizer.getTokenType() == PRTokeniser.TK_STRING) {
strbufe.append(tokenizer.getStringValue());
}
}
String test = strbufe.toString();
streamResult = new StreamResult("data.xml");
initXML();
process(test);
closeXML();
document.add(new Paragraph(".."));
document.close();
} catch (Exception e) {
}
}
public static void initXML() throws ParserConfigurationException,
TransformerConfigurationException, SAXException {
SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory
.newInstance();
handler = tf.newTransformerHandler();
Transformer serializer = handler.getTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
serializer.setOutputProperty(
"{http://xml.apache.org/xslt}indent-amount", "4");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
handler.setResult(streamResult);
handler.startDocument();
atts = new AttributesImpl();
handler.startElement("", "", "data", atts);
}
public static void process(String s) throws SAXException {
String[] elements = s.split("\|");
atts.clear();
handler.startElement("", "", "Message", atts);
handler.characters(elements[0].toCharArray(), 0, elements[0].length());
handler.endElement("", "", "Message");
}
public static void closeXML() throws SAXException {
handler.endElement("", "", "data");
handler.endDocument();
}
}