在 Java 中以编程方式将 Word 文档转换为 HTML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/227236/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 11:44:27  来源:igfitidea点击:

Convert Word doc to HTML programmatically in Java

javahtmlms-word

提问by kaychaks

I need to convert a Word document into HTML file(s) in Java. The function will take input an word document and the output will be html file(s) based on the number of pages the word document has i.e. if the word document has 3 pages then there will be 3 html files generated having the required page break.

我需要在 Java 中将 Word 文档转换为 HTML 文件。该函数将输入一个 word 文档,输出将是基于 word 文档页数的 html 文件,即如果 word 文档有 3 页,那么将生成 3 个具有所需分页符的 html 文件。

I searched for open source/non-commercial APIs which can convert doc to html but for no result. Anybody who have done this type of job before please help.

我搜索了可以将 doc 转换为 html 但没有结果的开源/非商业 API。任何以前做过这种工作的人请帮忙。

Thanks

谢谢

采纳答案by Chase Seibert

We use tm-extractors (http://mvnrepository.com/artifact/org.textmining/tm-extractors), and fall back to the commercial Aspose (http://www.aspose.com/). Both have native Java APIs.

我们使用 tm-extractors ( http://mvnrepository.com/artifact/org.textmining/tm-extractors),并回退到商业 Aspose ( http://www.aspose.com/)。两者都有本机 Java API。

回答by DavidG

You'd have to find the MS word doc specification ( since it is basically a binary dump of whatever is in word at that point in time ), and slowly go through it element by element converting ms word "objects/states" to the html equiv. you might be able to find a script to do it for u since this really isn't fun work and i'd advise against it ( converting file formats or even reading from commercial files on your own is always hard and often incomplete ). PS: just google doc2html

您必须找到 MS word doc 规范(因为它基本上是当时单词中任何内容的二进制转储),然后慢慢地逐个元素地遍历它,将 ms 单词“对象/状态”转换为 html相当于 你也许可以找到一个脚本来为你做这件事,因为这真的不是一件有趣的工作,我建议不要这样做(转换文件格式,甚至自己阅读商业文件总是很困难,而且往往不完整)。PS:只是谷歌doc2html

回答by Vincent Ramdhanie

If you are targeting word 2007 files using the ooxml format then this articlemight help. And there is the Ooxml4jproject which is implementing ooxml for Java library.

如果您的目标是使用 ooxml 格式的 word 2007 文件,那么本文可能会有所帮助。还有一个Ooxml4j项目,它正在为 Java 库实现 ooxml。

If you are targeting the binary files though...thats another problem.

如果您的目标是二进制文件……那是另一个问题。

回答by Jamie Love

I've used the following approach successfully in production systems where the new MS Word XML format isn't available:

我在新的 MS Word XML 格式不可用的生产系统中成功使用了以下方法:

Spawn a process that does something similar to:

生成一个执行类似以下操作的进程:

http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html

http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html

You'd probably want to start openoffice up once at startup of your program, and call the python script as many times during your program that you need to (with some sort of checking to ensure the ooffice process is always there).

您可能希望在程序启动时启动一次 openoffice,并在程序中根据需要多次调用 python 脚本(进行某种检查以确保 ooffice 进程始终存在)。

The other option is to spawn the following sort of command every time you need to do the conversion:

另一种选择是在每次需要进行转换时生成以下类型的命令:

ooffice -headless "macro://<path to ooffice vb macro to convert, with parameter pointing to file>"

ooffice -headless "macro://<要转换的 ooffice vb 宏的路径,参数指向文件>"

I've used the macro approach multiple times and it works well (sorry, I don't have the macro code available).

我已经多次使用宏方法并且效果很好(抱歉,我没有可用的宏代码)。

While there are mechanisms for doing it via MS Word, they're not easy from Java, and do require other support programs to drive MS Word via OLE.

虽然有通过 MS Word 执行此操作的机制,但它们在 Java 中并不容易,并且确实需要其他支持程序通过 OLE 驱动 MS Word。

I've used abiword before too, which works well for many documents, but does get confused with more complex documents (ooffice seems to handle everything I've thrown at it). Abiword has a slightly easier command line interface for conversion than ooffice.

我以前也使用过 abiword,它适用于许多文档,但确实会与更复杂的文档混淆(ooffice 似乎可以处理我抛出的所有内容)。Abiword 的转换命令行界面比 ooffice 稍微简单一些。

回答by Chase Seibert

import officetools.OfficeFile; // package available at www.dancrintea.ro/doc-to-pdf/
...
FileInputStream fis = new FileInputStream(new File("test.doc"));
FileOutputStream fos = new FileOutputStream(new File("test.html"));
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);
f.convert(fos,"html");

All possible conversions:

所有可能的转换:

doc --> pdf, html, txt, rtf

文档 --> pdf、html、txt、rtf

xls --> pdf, html, csv

xls --> pdf、html、csv

ppt --> pdf, swf

ppt --> pdf, swf

html --> pdf

html --> pdf

回答by JasonPlutext

If its a docx, you could use docx4j(ASL v2). This uses XSLT to create the HTML.

如果是 docx,则可以使用docx4j(ASL v2)。这使用 XSLT 创建 HTML。

However, it will give you a single HTML for the whole document.

但是,它会为您提供整个文档的单个 HTML。

If you wanted an HTML per page, you could do something with the lastRenderedPageBreak tag that Word puts into the docx (assuming you used Word to create it).

如果您想要每页的 HTML,您可以使用 Word 放入 docx 的 lastRenderedPageBreak 标记做一些事情(假设您使用 Word 创建它)。

回答by JasonPlutext

It is easier to do this in the new MS word docx as the format is in XML. You can use an XSL to transform the Word doc in XML format to an HTML format.

在新的 MS word docx 中更容易做到这一点,因为格式是 XML。您可以使用 XSL 将 XML 格式的 Word 文档转换为 HTML 格式。

If however your Word doc is in an old version, you can use POI library http://poi.apache.org/and then access that and generate a Java object and from that point on you can easily convert it to an HTML format using an HTML java library

但是,如果您的 Word 文档是旧版本,您可以使用 POI 库http://poi.apache.org/然后访问它并生成一个 Java 对象,从那时起您可以使用以下 方法轻松将其转换为 HTML 格式一个 HTML Java 库

http://www.dom4j.org/dom4j-1.4/apidocs/org/dom4j/io/HTMLWriter.html

http://www.dom4j.org/dom4j-1.4/apidocs/org/dom4j/io/HTMLWriter.html

回答by Paul Jowett

I see this thread turns up in external links and has the occasional post so I thought I'd post an update (hope no one minds). OpenOffice continues to evolve and release 3.2 improves the word import export filters again. OpenOffice and Java can run on many platforms so Java systems can make use of the OpenOffice UNO API directly to import/manipulate/export documents in many formats (including word and pdf) or use a library like JODReportsor Docmosisto facilitate. Both have free/open options.

我看到这个线程出现在外部链接中并且偶尔有帖子,所以我想我会发布更新(希望没有人介意)。OpenOffice 不断发展,3.2 版再次改进了单词导入导出过滤器。OpenOffice 和 Java 可以在许多平台上运行,因此 Java 系统可以直接使用 OpenOffice UNO API 来导入/操作/导出多种格式(包括 word 和 pdf)的文档,或者使用JODReportsDocmosis 之类的库来方便。两者都有免费/开放选项。

回答by Fisher

I recommend the JODConverter, It leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today.

我推荐JODConverter,它利用 OpenOffice.org,它为当今可用的 OpenDocument 和 Microsoft Office 格式提供了可以说是最好的导入/导出过滤器。

JODConverter has a lot of documents, scripts, and tutorials to help you out.

JODConverter 有很多文档、脚本和教程可以帮助您。

回答by Yusuf D.Kutni Z Felemban

I tried this way and its work with me from this site http://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

我尝试过这种方式,它与我一起从这个网站 http://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

This only work with docx to convert it into html included images inside that word document.

这仅适用于 docx 将其转换为该 Word 文档中包含 html 的图像。

    // 1) Load DOCX into XWPFDocument
    InputStream doc = new FileInputStream(new File("c:/document.docx"));
    XWPFDocument document = new XWPFDocument(doc);

   // 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
            XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;

            // 3) Extract image
            String root = "target";
            File imageFolder = new File( root + "/images/" + doc );
            options.setExtractor( new FileImageExtractor( imageFolder ) );
            // 4) URI resolver
            options.URIResolver( new FileURIResolver( imageFolder ) );


            OutputStream out = new FileOutputStream(new File("c:/document.html"));
            XHTMLConverter.getInstance().convert(document, out, options);

I hope this solve your issue

我希望这能解决你的问题