如何使用 Apache POI 读取 Java 中的 .DOC 文件以将图像与文本分开?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/597566/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I use Apache POI to read a .DOC file in Java to separate images from text?
提问by
I need to read a Word .doc file from Java that has text and images. I need to recognize the images & text and separate them into 2 files.
我需要从 Java 中读取一个包含文本和图像的 Word .doc 文件。我需要识别图像和文本并将它们分成 2 个文件。
I've recently heard about "Apache POI." How I can use Apache POI to read Word .doc files?
我最近听说过“Apache POI”。如何使用 Apache POI 读取 Word .doc 文件?
回答by
The examples and sample code on apache's site are pretty good. I recommend you start there.
apache 站点上的示例和示例代码非常好。我建议你从那里开始。
http://poi.apache.org/hwpf/quick-guide.html
http://poi.apache.org/hwpf/quick-guide.html
To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.
要获取特定的文本位,首先创建一个 org.apache.poi.hwpf.HWPFDocument。使用 getRange() 获取范围,然后从中获取段落。然后,您可以获得文本和其他属性。
Herefor an example of extracting an image. Herefor the latest revision as of this writing.
And of course, the Javadocs
当然,Javadocs
Note that, according to the POI site,
请注意,根据 POI 网站,
HWPF is still in early development.
HWPF 仍处于早期开发阶段。
回答by banjollity
It's not free (or even cheap!) but Aspose.Wordsshould be able to do this. Their evaluation download will let you play with small files.
它不是免费的(甚至便宜!)但是Aspose.Words应该能够做到这一点。他们的评估下载将让您玩小文件。
Do the destination files also have to be Docs? You could open the docs in Office and save them out as HTML. Then the separation becomes trivial. RTF is also a viable option, but I can't recommend a good RTF parser off the top of my head.
目标文件是否也必须是 Docs?您可以在 Office 中打开文档并将其另存为 HTML。然后分离变得微不足道。RTF 也是一个可行的选择,但我无法推荐一个好的 RTF 解析器。
Edit to say:I just remembered another possible solution: Jacob, but you'll need an instance of Office running on the same machine. It's short for Java COM Bridge and it lets you make calls to the COM libraries in Office to manipulate the documents. I'm sure it's not as scary as it might sound!
编辑说:我刚刚想起了另一个可能的解决方案:Jacob,但您需要在同一台机器上运行一个 Office 实例。它是 Java COM Bridge 的缩写,它允许您调用 Office 中的 COM 库来操作文档。我敢肯定它并不像听起来那么可怕!