java 如何将 .doc 或 .docx 文件转换为 .txt

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2709923/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 22:27:26  来源:igfitidea点击:

How to convert .doc or .docx files to .txt

javams-word

提问by Coding District

I'm wondering how you can convert Word .doc/.docx files to text files through Java. I understand that there's an option where I can do this through Word itself but I would like to be able to do something like this:

我想知道如何通过 Java 将 Word .doc/.docx 文件转换为文本文件。我知道有一个选项可以通过 Word 本身执行此操作,但我希望能够执行以下操作:

java DocConvert somedocfile.doc converted.txt

Thanks.

谢谢。

回答by stakx - no longer contributing

If you're interested in a Java library that deals with Word document files, you might want to look at e.g. Apache POI. A quote from the website:

如果您对处理 Word 文档文件的 Java 库感兴趣,您可能想查看例如Apache POI。来自网站的引用:

Why should I use Apache POI?

A major use of the Apache POI api is for Text Extraction applications such as web spiders, index builders, and content management systems.

为什么要使用 Apache POI?

Apache POI api 的一个主要用途是用于文本提取应用程序,例如网络蜘蛛、索引构建器和内容管理系统。



P.S.: If, on the other hand, you're simply looking for a conversion utility, Stack Overflow may not be the most appropriate place to ask for this.

PS:另一方面,如果您只是在寻找转换实用程序,则 Stack Overflow 可能不是提出此要求的最合适的地方。



Edit:If you don't want to use an existing library but do all the hard work yourself, you'll be glad to hear that Microsoft has published the required file format specifications. (The Microsoft Open Specification Promiselists the available specifications. Just google for any of them that you're interested in. In your case, you'd need e.g. the OLE2 Compound File Format, the Word 97 binary file format, and the Open XML formats.)

编辑:如果您不想使用现有的库而是自己完成所有艰苦的工作,您会很高兴听到 Microsoft 已发布所需的文件格式规范。(Microsoft Open Specification Promise列出了可用的规范。只需在 google 上搜索您感兴趣的任何规范。就您而言,您需要例如 OLE2 复合文件格式、Word 97 二进制文件格式和 Open XML格式。)

回答by palhares

Use command line utility Apache Tika. Tika suports a wide number of formats (ex: doc, docx, pdf, html, rtf ...)

使用命令行实用程序Apache Tika。Tika 支持多种格式(例如:doc、docx、pdf、html、rtf ...)

java -jar tika-app-1.3.jar -t somedocfile.doc > converted.txt

Programatically:

以编程方式:

File inputFile = ...;
Tika tika = new Tika();
String extractedText = tika.parseToString(inputFile);

You can use Apache POItoo. They have a tool to extract text from doc/docx Text Extraction. If you want to extract only the text, you can use the code below. If you want to extract Rich Text (such as formatting and styling), you can use Apache Tika.

您也可以使用Apache POI。他们有一个工具可以从 doc/docx Text Extraction 中提取文本。如果只想提取文本,可以使用下面的代码。如果要提取富文本(例如格式和样式),可以使用 Apache Tika。

Extract doc:

提取文档:

InputStream fis = new FileInputStream(...);
POITextExtractor extractor;
// if docx
if (fileName.toLowerCase().endsWith(".docx")) {
    XWPFDocument doc = new XWPFDocument(fis);
    extractor = new XWPFWordExtractor(doc);
} else {
    // if doc
    POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
    extractor = ExtractorFactory.createExtractor(fileSystem);
}
String extractedText = extractor.getText();

回答by bragboy

You should consider using this library. Its Apache POI

你应该考虑使用这个库。它的Apache POI

Excerpt from the website

摘自网站

In short, you can read and write MS Excel files using Java. In addition, you can read and write MS Word and MS PowerPoint files using Java. Apache POI is your Java Excel solution (for Excel 97-2008). We have a complete API for porting other OOXML and OLE2 formats and welcome others to participate.

简而言之,您可以使用 Java 读取和写入 MS Excel 文件。此外,您可以使用 Java 读取和写入 MS Word 和 MS PowerPoint 文件。Apache POI 是您的 Java Excel 解决方案(适用于 Excel 97-2008)。我们有一个完整的 API 用于移植其他 OOXML 和 OLE2 格式,欢迎其他人参与。

回答by Paul Jowett

Docmosiscan read a doc and spit out the text in it. Requires some infrastructure to be installed (such as OpenOffice). You can also use JODConverter.

Docmosis可以读取文档并吐出其中的文本。需要安装一些基础设施(例如 OpenOffice)。您也可以使用JODConverter