python 将 microsoft office 文档转换为 linux 上的纯文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/685533/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:37:44  来源:igfitidea点击:

python convert microsoft office docs to plain text on linux

pythonlinuxms-office

提问by Tim

Any recomendations on a method to convert .doc, .ppt, and .xls to plain text on linux using python? Really any method of conversion would be useful. I have already looked at using Open Office but, I would like a solution that does not require having to install Open Office.

关于在 Linux 上使用 python 将 .doc、.ppt 和 .xls 转换为纯文本的方法的任何建议?真的任何转换方法都会有用。我已经考虑过使用 Open Office,但是,我想要一个不需要安装 Open Office 的解决方案。

采纳答案by ChristopheD

I'd go for the command line-solution (and then use the Python subprocess moduleto run the tools from Python).

我会使用命令行解决方案(然后使用Python 子进程模块从 Python 运行工具)。

Convertors for msword (catdoc), excel (xls2csv) and ppt (catppt) can be found (in source form) here: http://vitus.wagner.pp.ru/software/catdoc/.

msword ( catdoc)、excel ( xls2csv) 和 ppt ( catppt) 的转换器可以在这里找到(以源格式):http: //vitus.wagner.pp.ru/software/catdoc/

Can't really comment on the usefullness of catppt but catdoc and xls2csv work great!

不能真正评论 catppt 的有用性,但 catdoc 和 xls2csv 工作得很好!

But be sure to first search your distributions repositories... On ubuntu for example catdoc is just one fast apt-get away.

但一定要先搜索你的发行版存储库......例如在 ubuntu 上 catdoc 只是一个快速的 apt-get away。

回答by vartec

回答by emk

The usual tool for converting Microsoft Office documents to HTML or other formats was mswordview, which has since been renamed to vwWare.

将 Microsoft Office 文档转换为 HTML 或其他格式的常用工具是 mswordview,它已被重命名为vwWare

If you're looking for a command-line tool, they actually recommend using AbiWord to perform the conversion:

如果您正在寻找命令行工具,他们实际上建议使用 AbiWord 来执行转换:

AbiWord --to=txt

If you're looking for a library, start on the wvWare overview page. They also maintain a list of libraries and tools which read MS Office documents.

如果您正在寻找库,请从wvWare 概览页面开始。他们还维护一个可读取 MS Office 文档的库和工具列表

回答by Telemachus

At the command line, antiwordor wvwork very nicely for .doc files. (Not a Python solution, but they're easy to install and fast.)

在命令行中,antiwordwv非常适合 .doc 文件。(不是 Python 解决方案,但它们易于安装且速度快。)

回答by neves

Same problem here. Below is my simple script to convert all doc files in dir 'docs/' to dir 'txts/' using catdoc. Hope it will help someone:

同样的问题在这里。下面是我使用 catdoc 将 dir 'docs/' 中的所有 doc 文件转换为 dir 'txts/' 的简单脚本。希望它会帮助某人:

#!/usr/bin/env python 
# -*- coding: utf-8 -*-

import glob, re, os
f = glob.glob('docs/*.doc') + glob.glob('docs/*.DOC')

outDir = 'txts'
if not os.path.exists(outDir):
    os.makedirs(outDir)
for i in f:
    os.system("catdoc -w '%s' > '%s'" %
              (i, outDir + '/' + re.sub(r'.*/([^.]+)\.doc', r'.txt', i,
                                   flags=re.IGNORECASE)))

回答by Dave Webb

For dealing with Excel Spreadsheets xlwtis good. But it won't help with .docand .pptfiles.

对于处理 Excel 电子表格xlwt是好的。但它不会帮助.doc.ppt文件。

(You may have also heard of PyExcelerator. xlwt is a fork of this and better maintained so I think you'd be better of with xlwt.)

(你可能也听说过 PyExcelerator。xlwt 是它的一个分支,并且维护得更好,所以我认为你会更好地使用 xlwt。)

回答by D.Shawley

I've had some success at using XSLT to process the XML-based office files into something usable in the past. It's not necessarily a python-based solution, but it does get the job done.

过去,我在使用 XSLT 将基于 XML 的办公文件处理成可用的文件方面取得了一些成功。它不一定是基于 Python 的解决方案,但确实可以完成工作。