Python 处理 Word 文档的最佳方式

Question

提问by Mikhail

I receive word documents with specified formating corresponding to the data that is in them. For example, all headers have the exact same formating (Times New Roman-Font 14-Bold).

我收到的 Word 文档具有与其中的数据相对应的指定格式。例如，所有标题都具有完全相同的格式（Times New Roman-Font 14-Bold）。

What is the best way to process such MS Word documents (.doc or .docx) into xml documents? Language is not an issue (I'll use Lisp/Boost.Spirit if I have to!).

将此类 MS Word 文档（.doc 或 .docx）处理为 xml 文档的最佳方法是什么？语言不是问题（如果需要，我会使用 Lisp/Boost.Spirit！）。

Answer 1

采纳答案by Mikhail

Used a very inefficient conditional search in VBA to literally copy the document into a second document. The second document was then saved with a .xml extension. Got the job done, but its ugly.

在 VBA 中使用非常低效的条件搜索将文档复制到第二个文档中。然后使用 .xml 扩展名保存第二个文档。完成了工作，但它很丑陋。

Answer 2

回答by David Claridge

So I think you're saying that the structure of the document is encoded in the formatting, and you want to produce XML files that capture that structure, whilst keeping the content in plain text?

所以我认为您是说文档的结构以格式编码，并且您想要生成捕获该结构的 XML 文件，同时将内容保留为纯文本？

If that is so you will need to parse the documents, and build a data structure that can be processed, then dumped out as XML.

如果是这样，您将需要解析文档，并构建可以处理的数据结构，然后将其转储为 XML。

For parsing, there are a few options. Microsoft have publishedthe specifications for their binary .doc format, the reading of which will be essential to write a parser for it. In the case of .docx you're a little more lucky, as it's already in XML format, so you could use any XML parsing library to read in the file, then search through the resulting tree for the data you are interested in. XML parsers are available for pretty much any language, one easy to use one that comes to mind is MiniDomfor Python.

对于解析，有几个选项。Microsoft已经发布了其二进制 .doc 格式的规范，阅读这些规范对于为其编写解析器至关重要。在 .docx 的情况下，你更幸运一点，因为它已经是 XML 格式，所以你可以使用任何 XML 解析库来读取文件，然后在结果树中搜索你感兴趣的数据。 XML解析器几乎可用于任何语言，一种易于使用的语言是MiniDomfor Python。

For generating your output XML, again an object-representation to XML library seems to be the way to go, MiniDom for example, does that too.

为了生成您的输出 XML，再一次将对象表示到 XML 库似乎是要走的路，例如 MiniDom，也可以这样做。

If you don't want to deal with writing your own .doc parser, you could run the documents through a converter that produces are more accessible format first - such as using Word itself to convert the .doc files to .docx, or a tool that produces RDFs from .docs, or you could use an existing word parser such as the one in OpenOffice.

如果您不想编写自己的 .doc 解析器，则可以通过首先生成更易于访问的格式的转换器运行文档 - 例如使用 Word 本身将 .doc 文件转换为 .docx，或使用工具从 .docs 生成 RDF，或者您可以使用现有的单词解析器，例如 OpenOffice 中的单词解析器。

Answer 3

回答by n002213f

You can also try Java based Apache POI - HWPF. It supports text extraction. You will then have to create you own XML doc, Caster XMLor Xstreamcan help you on that issue.

您还可以尝试基于 Java 的Apache POI-HWPF。它支持文本提取。然后您必须创建您自己的 XML 文档，Caster XML或Xstream可以帮助您解决这个问题。

Answer 4

回答by Etienne

Take a look at the python-docxlibrary.

查看python-docx库。

Answer 5

回答by JasonPlutext

It really depends on exactly what you are trying to do.

这真的取决于你想要做什么。

The simplest approach would be to save the document as Flat OPC XML (in Word, "Save as.." XML), and then apply an XSLT.

最简单的方法是将文档保存为 Flat OPC XML（在 Word 中，“另存为..”XML），然后应用 XSLT。

This approach is simplest, since it gives you the entire docx as a single XML file, so you don't have to unzip it etc.

这种方法最简单，因为它将整个 docx 作为单个 XML 文件提供给您，因此您不必解压缩它等。

If your requirements are more complex, for example, analyzing the formatting or styles, or doing something with hyperlinks, then an object model such as docx4j (Java) or Open XML SDK (C#) - and no doubt there are others - may help.

如果您的需求更复杂，例如，分析格式或样式，或使用超链接执行某些操作，那么对象模型（例如 docx4j (Java) 或 Open XML SDK (C#) - 毫无疑问还有其他）可能会有所帮助。

Python 处理 Word 文档的最佳方式

提问by Mikhail

采纳答案by Mikhail

回答by David Claridge

回答by n002213f

回答by Etienne

回答by JasonPlutext

相关推荐

最近更新

标签

Python 处理 Word 文档的最佳方式

提问by Mikhail

采纳答案by Mikhail

回答by David Claridge

回答by n002213f

回答by Etienne

回答by JasonPlutext

相关推荐

如何使用另一个python脚本文件中的参数执行python脚本文件

Python - NumPy - 元组作为数组的元素

如何使用 Python 将“NULL”值插入到 PostgreSQL 数据库中？

python3.x 中的 StringType 和 NoneType

相关推荐

最近更新

标签