Python 处理 Word 文档的最佳方式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4262903/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best Way to Process a Word Document
提问by Mikhail
I receive word documents with specified formating corresponding to the data that is in them. For example, all headers have the exact same formating (Times New Roman-Font 14-Bold).
我收到的 Word 文档具有与其中的数据相对应的指定格式。例如,所有标题都具有完全相同的格式(Times New Roman-Font 14-Bold)。
What is the best way to process such MS Word documents (.doc or .docx) into xml documents? Language is not an issue (I'll use Lisp/Boost.Spirit if I have to!).
将此类 MS Word 文档(.doc 或 .docx)处理为 xml 文档的最佳方法是什么?语言不是问题(如果需要,我会使用 Lisp/Boost.Spirit!)。
采纳答案by Mikhail
Used a very inefficient conditional search in VBA to literally copy the document into a second document. The second document was then saved with a .xml extension. Got the job done, but its ugly.
在 VBA 中使用非常低效的条件搜索将文档复制到第二个文档中。然后使用 .xml 扩展名保存第二个文档。完成了工作,但它很丑陋。
回答by David Claridge
So I think you're saying that the structure of the document is encoded in the formatting, and you want to produce XML files that capture that structure, whilst keeping the content in plain text?
所以我认为您是说文档的结构以格式编码,并且您想要生成捕获该结构的 XML 文件,同时将内容保留为纯文本?
If that is so you will need to parse the documents, and build a data structure that can be processed, then dumped out as XML.
如果是这样,您将需要解析文档,并构建可以处理的数据结构,然后将其转储为 XML。
For parsing, there are a few options. Microsoft have publishedthe specifications for their binary .doc format, the reading of which will be essential to write a parser for it. In the case of .docx you're a little more lucky, as it's already in XML format, so you could use any XML parsing library to read in the file, then search through the resulting tree for the data you are interested in. XML parsers are available for pretty much any language, one easy to use one that comes to mind is MiniDomfor Python.
对于解析,有几个选项。Microsoft已经发布了其二进制 .doc 格式的规范,阅读这些规范对于为其编写解析器至关重要。在 .docx 的情况下,你更幸运一点,因为它已经是 XML 格式,所以你可以使用任何 XML 解析库来读取文件,然后在结果树中搜索你感兴趣的数据。 XML解析器几乎可用于任何语言,一种易于使用的语言是MiniDomfor Python。
For generating your output XML, again an object-representation to XML library seems to be the way to go, MiniDom for example, does that too.
为了生成您的输出 XML,再一次将对象表示到 XML 库似乎是要走的路,例如 MiniDom,也可以这样做。
If you don't want to deal with writing your own .doc parser, you could run the documents through a converter that produces are more accessible format first - such as using Word itself to convert the .doc files to .docx, or a tool that produces RDFs from .docs, or you could use an existing word parser such as the one in OpenOffice.
如果您不想编写自己的 .doc 解析器,则可以通过首先生成更易于访问的格式的转换器运行文档 - 例如使用 Word 本身将 .doc 文件转换为 .docx,或使用工具从 .docs 生成 RDF,或者您可以使用现有的单词解析器,例如 OpenOffice 中的单词解析器。
回答by n002213f
You can also try Java based Apache POI - HWPF. It supports text extraction. You will then have to create you own XML doc, Caster XMLor Xstreamcan help you on that issue.
您还可以尝试基于 Java 的Apache POI-HWPF。它支持文本提取。然后您必须创建您自己的 XML 文档,Caster XML或Xstream可以帮助您解决这个问题。
回答by Etienne
Take a look at the python-docxlibrary.
查看python-docx库。
回答by JasonPlutext
It really depends on exactly what you are trying to do.
这真的取决于你想要做什么。
The simplest approach would be to save the document as Flat OPC XML (in Word, "Save as.." XML), and then apply an XSLT.
最简单的方法是将文档保存为 Flat OPC XML(在 Word 中,“另存为..”XML),然后应用 XSLT。
This approach is simplest, since it gives you the entire docx as a single XML file, so you don't have to unzip it etc.
这种方法最简单,因为它将整个 docx 作为单个 XML 文件提供给您,因此您不必解压缩它等。
If your requirements are more complex, for example, analyzing the formatting or styles, or doing something with hyperlinks, then an object model such as docx4j (Java) or Open XML SDK (C#) - and no doubt there are others - may help.
如果您的需求更复杂,例如,分析格式或样式,或使用超链接执行某些操作,那么对象模型(例如 docx4j (Java) 或 Open XML SDK (C#) - 毫无疑问还有其他)可能会有所帮助。

