java Apache POI - 使用图像将 *.doc 转换为 *.html

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13815119/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 14:03:57  来源:igfitidea点击:

Apache POI - converting *.doc to *.html with images

javaapache-poidoc

提问by

There is a DOC file that contains some image. How to convert it to HTML with image?

有一个包含一些图像的 DOC 文件。如何将其转换为带有图像的 HTML?

I tried to use this example: Convert Word doc to HTML programmatically in Java

我尝试使用这个例子: Convert Word doc to HTML programmatically in Java

public class Converter {
    ...

    private File docFile, htmlFile;

    try {
        FileInputStream fos = new FileInputStream(docFile.getAbsolutePath()); 
        HWPFDocument doc = new HWPFDocument(fos);       
        Document newDoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDoc) ;
        wordToHtmlConverter.processDocument(doc);

        StringWriter stringWriter = new StringWriter();

        Transformer transformer = TransformerFactory.newInstance().newTransformer();        
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
        transformer.setOutputProperty(OutputKeys.METHOD, "html");
        transformer.transform(
                    new DOMSource(wordToHtmlConverter.getDocument()),
                    new StreamResult(stringWriter)
        );

        String html = stringWriter.toString();

        try {
            BufferedWriter out = new BufferedWriter(
                new OutputStreamWriter(new FileOutputStream(htmlFile), "UTF-8")
            );     
            out.write(html);
            out.close();
       } catch (IOException e) {
           e.printStackTrace();
       }

       JEditorPane jEditorPane = new JEditorPane();
       jEditorPane.setContentType("text/html");
       jEditorPane.setEditable(false);
       jEditorPane.setPage(htmlFile.toURI().toURL());

       JScrollPane jScrollPane = new JScrollPane(jEditorPane);

       JFrame jFrame = new JFrame("display html file");
       jFrame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
       jFrame.getContentPane().add(jScrollPane);
       jFrame.setSize(512, 342);
       jFrame.setVisible(true);

    } catch(Exception e) {
        e.printStackTrace();
    }
    ...
}

But the image is lost.

但是图像丢失了。

The documentationfor the WordToHtmlConverterclass says the following:

文档WordToHtmlConverter类说的情况如下:

...this implementation doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture)method.

...此实现不会创建图像或指向它们的链接。这可以通过覆盖AbstractWordConverter.processImage(Element, boolean, Picture)方法来改变 。

How to convert DOC to HTML with images?

如何将带有图像的 DOC 转换为 HTML?

采纳答案by Gagravarr

Your best bet in this case is to use Apache Tika, and let it wrap Apache POI for you. Apache Tika will generate HTML for your document (or plain text, but you want the HTML for your case). Along with that, it'll put in placeholders for embedded resources, img tags for embedded images, and provide you with a way to get at the contents of the embedded resources and images.

在这种情况下,您最好的选择是使用Apache Tika,让它为您包装 Apache POI。Apache Tika 将为您的文档(或纯文本,但您需要针对您的案例的 HTML)生成 HTML。除此之外,它还将为嵌入资源放置占位符,为嵌入图像放置 img 标签,并为您提供一种获取嵌入资源和图像内容的方法。

There's a very good example of doing this included in Alfresco, HTMLRenderingEngine. You'll likely want to review the code there, then write your own to do something very similar. The code there includes a custom ContentHandler which allows editing of the img tags, to re-write the src attributes, you may or may not need that depending on where you're going to write out the images to.

在 Alfresco HTMLRenderingEngine 中有一个很好的例子。您可能希望查看那里的代码,然后编写自己的代码来执行非常相似的操作。那里的代码包括一个自定义 ContentHandler ,它允许编辑 img 标签,重新编写 src 属性,您可能需要也可能不需要,这取决于您要将图像写到哪里。

回答by raok1997

Extend WordToHtmlConverter and override processImageWithoutPicturesManager.

扩展 WordToHtmlConverter 并覆盖 processImageWithoutPicturesManager

 import java.util.Base64;

import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.usermodel.Picture;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter {

    public InlineImageWordToHtmlConverter(Document document) {
        super(document);
    }

    @Override
    protected void processImageWithoutPicturesManager(Element currentBlock,
        boolean inlined, Picture picture)
    {
        Element imgNode = currentBlock.getOwnerDocument().createElement("img");
        StringBuilder sb = new StringBuilder();
        sb.append(Base64.getMimeEncoder().encodeToString(picture.getRawContent()));
        sb.insert(0, "data:"+picture.getMimeType()+";base64,");
        imgNode.setAttribute("src", sb.toString());
        currentBlock.appendChild(imgNode);
    }

}

Use the new class while parsing document as shown below

在解析文档时使用新类,如下所示

HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream("D:/temp/Temp.doc"));    
        WordToHtmlConverter wordToHtmlConverter = new InlineImageWordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());
        wordToHtmlConverter.processDocument(wordDocument);