python 解析和生成 Microsoft Office 2007 文件(.docx、.xlsx、.pptx)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/173246/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 19:36:24  来源:igfitidea点击:

Parsing and generating Microsoft Office 2007 files (.docx, .xlsx, .pptx)

phppythonperlparsingoffice-2007

提问by DV.

I have a web project where I must import text and images from a user-supplied document, and one of the possible formats is Microsoft Office 2007. There's also a need to generate documents in this format.

我有一个 web 项目,我必须从用户提供的文档中导入文本和图像,其中一种可能的格式是 Microsoft Office 2007。还需要以这种格式生成文档。

The server runs CentOS 5.2 and has PHP/Perl/Python installed. I can execute local binaries and shell scripts if I must. We use Apache 2.2 but will be switching over to Nginx once it goes live.

服务器运行 CentOS 5.2 并安装了 PHP/Perl/Python。如果需要,我可以执行本地二进制文件和 shell 脚本。我们使用 Apache 2.2,但一旦上线,就会切换到 Nginx。

What are my options? Anyone had experience with this?

我有哪些选择?任何人都有这方面的经验?

回答by 1800 INFORMATION

The Office 2007 file formats are open and well documented. Roughly speaking, all of the new file formats ending in "x" are zip compressed XML documents. For example:

Office 2007 文件格式是开放的且有详细记录。粗略地说,所有以“x”结尾的新文件格式都是 zip 压缩的 XML 文档。例如:

To open a Word 2007 XML file Create a temporary folder in which to store the file and its parts.

Save a Word 2007 document, containing text, pictures, and other elements, as a .docx file.

Add a .zip extension to the end of the file name.

Double-click the file. It will open in the ZIP application. You can see the parts that comprise the file.

Extract the parts to the folder that you created previously.

打开 Word 2007 XML 文件 创建一个临时文件夹,用于存储文件及其部分。

将包含文本、图片和其他元素的 Word 2007 文档另存为 .docx 文件。

在文件名末尾添加 .zip 扩展名。

双击该文件。它将在 ZIP 应用程序中打开。您可以看到组成文件的部分。

将零件提取到您之前创建的文件夹中。

The other file formats are roughly similar. I don't know of any open source libraries for interacting with them as yet - but depending on your exact requirements, it doesn't look too difficult to read and write simple documents. Certainly it should be a lot easier than with the older formats.

其他文件格式大致相似。我还不知道有任何开源库可以与它们交互——但是根据您的具体要求,阅读和编​​写简单的文档看起来并不太难。当然,它应该比旧格式容易得多。

If you need to read the older formats, OpenOffice has an API and can read and write Office 2003 and older documents with more or less success.

如果您需要阅读较旧的格式,OpenOffice 有一个 API,可以或多或少地成功读取和写入 Office 2003 和较旧的文档。

回答by mikemaccana

The python docx module can generate formatted Microsoft office docx files from pure Python. Out of the box, it does headers, paragraphs, tables, and bullets, but the makeelement() module can be extended to do arbitrary elements like images.

python docx 模块可以从纯 Python 生成格式化的 Microsoft Office docx 文件。开箱即用,它可以处理标题、段落、表格和项目符号,但 makeelement() 模块可以扩展为处理任意元素,例如图像。

from docx import *
document = newdocument()

# This location is where most document content lives 
docbody = document.xpath('/w:document/w:body',namespaces=wordnamespaces)[0]

# Append two headings
docbody.append(heading('Heading',1)  )   
docbody.append(heading('Subheading',2))
docbody.append(paragraph('Some text')

回答by Hafthor

I have successfully used the OpenXML Format SDKin a project to modify an Excel spreadsheet via code. This would require .NET and I'm not sure about how well it would work under Mono.

我已经成功地在一个项目中使用OpenXML Format SDK通过代码修改 Excel 电子表格。这将需要 .NET,我不确定它在 Mono 下的工作情况。

回答by Darryl Hein

You can probably check the code for Sphider. They docs and pdfs, so I'm sure they can read them. Might also lead you in the right direction for other Office formats.

您可能可以检查Spider的代码。他们有文档和 pdf,所以我相信他们可以阅读。也可能会引导您朝着其他 Office 格式的正确方向前进。