如何自动将 Doc/Docx 转换为单个 XML 文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11932163/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert Doc/Docx into a single XML file automatically?
提问by samxli
When you open up Word, it allows you to save as Word Open XML format. I've seen posts regarding opening up the docx file as a zip and then extracting stuff from there. But what I really want is a way to turn the docx into a single XML exactly like when doing the "save as" action in MS Office. What to do?
当您打开 Word 时,它允许您另存为 Word Open XML 格式。我看过有关以 zip 格式打开 docx 文件然后从那里提取内容的帖子。但我真正想要的是一种将 docx 转换为单个 XML 的方法,就像在 MS Office 中执行“另存为”操作时一样。该怎么办?
And how to do this for the .doc format ?
以及如何为 .doc 格式执行此操作?
Note: I would like to do this programmatically. Preferably under Linux development conditions with PHP. But if that's not available, then other languages will do. Lastly, if it comes down to it, I can consider spinning up a Windows server to do this.
注意:我想以编程方式执行此操作。最好在Linux 条件下用PHP 开发。但是,如果这不可用,那么其他语言也可以。最后,如果归根结底,我可以考虑启动 Windows 服务器来执行此操作。
回答by Pierre Fran?ois
Sorry to resuscitate a dead thread, but I just found an answer for the DOCX files. A DOCX file is just a ZIP archive of XML files. So for extracting the contents of one of its file, v.gr. word/document.xml under a Linux environment, you have to run unzip:
很抱歉让死线程复活,但我刚刚找到了 DOCX 文件的答案。DOCX 文件只是 XML 文件的 ZIP 存档。因此,为了提取其文件之一的内容,v.gr。Linux环境下的word/document.xml,需要运行解压:
unzip -q -c myfile.docx word/document.xml
For catching the output of this command into the $xml variable of a PHP script, you can issue:
要将此命令的输出捕获到 PHP 脚本的 $xml 变量中,您可以发出:
$xml = shell_exec ("unzip -q -c myfile.docx word/document.xml");
Hoping this answer helps for DOCX files. Better late than never.
希望这个答案对 DOCX 文件有帮助。迟到总比不到好。
For DOC files, this method does not work.
对于 DOC 文件,此方法不起作用。
回答by JasonPlutext
Eric White explains how to do this for docx in C# at transforming-open-xml-documents-to-flat-opc-format
Eric White 在transforming-open-xml-documents-to-flat-opc-format 中解释了如何在 C# 中为 docx 执行此操作
You can also do it using docx4j (which I work on), the 'j' being Java.
您也可以使用 docx4j(我正在研究)来完成它,'j' 是 Java。
回答by JohnZaj
In Word: file | save as | Word XML Document (*.xml) gives you the Open XML Format you want, as a single XML file
在 Word 中:文件 | 另存为 | Word XML 文档 (*.xml) 为您提供所需的 Open XML 格式,作为单个 XML 文件
In code using Interop: use Document object's SaveAs method, using WdSaveFormat.wdFormatXMLDocument as the save format. You should also use the Document.Convert method to update the compatibility to the MS Office version installed.
在使用 Interop 的代码中:使用 Document 对象的 SaveAs 方法,使用 WdSaveFormat.wdFormatXMLDocument 作为保存格式。您还应该使用 Document.Convert 方法来更新与安装的 MS Office 版本的兼容性。
So, not necessarily a complete demo, but this should give you the right idea:
所以,不一定是完整的演示,但这应该给你正确的想法:
ActiveDocument.Convert();
WdSaveFormat myNewSaveFormat = WdSaveFormat.wdFormatXMLDocument;
ActiveDocument.SaveAs(newFilePath, myNewSaveFormat); //where newFilePath can be a FileInfo object specifying the new file name and extension (docx)

