python 将 microsoft office 文档转换为 linux 上的纯文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/685533/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python convert microsoft office docs to plain text on linux
提问by Tim
Any recomendations on a method to convert .doc, .ppt, and .xls to plain text on linux using python? Really any method of conversion would be useful. I have already looked at using Open Office but, I would like a solution that does not require having to install Open Office.
关于在 Linux 上使用 python 将 .doc、.ppt 和 .xls 转换为纯文本的方法的任何建议?真的任何转换方法都会有用。我已经考虑过使用 Open Office,但是,我想要一个不需要安装 Open Office 的解决方案。
采纳答案by ChristopheD
I'd go for the command line-solution (and then use the Python subprocess moduleto run the tools from Python).
我会使用命令行解决方案(然后使用Python 子进程模块从 Python 运行工具)。
Convertors for msword (catdoc), excel (xls2csv) and ppt (catppt) can be found (in source form) here: http://vitus.wagner.pp.ru/software/catdoc/.
msword ( catdoc)、excel ( xls2csv) 和 ppt ( catppt) 的转换器可以在这里找到(以源格式):http: //vitus.wagner.pp.ru/software/catdoc/。
Can't really comment on the usefullness of catppt but catdoc and xls2csv work great!
不能真正评论 catppt 的有用性,但 catdoc 和 xls2csv 工作得很好!
But be sure to first search your distributions repositories... On ubuntu for example catdoc is just one fast apt-get away.
但一定要先搜索你的发行版存储库......例如在 ubuntu 上 catdoc 只是一个快速的 apt-get away。
回答by vartec
You can access OpenOffice via Python API.
您可以通过 Python API访问OpenOffice。
Try using this as a base: http://wiki.services.openoffice.org/wiki/Odt2txt.py
尝试使用它作为基础:http: //wiki.services.openoffice.org/wiki/Odt2txt.py
回答by emk
The usual tool for converting Microsoft Office documents to HTML or other formats was mswordview, which has since been renamed to vwWare.
将 Microsoft Office 文档转换为 HTML 或其他格式的常用工具是 mswordview,它已被重命名为vwWare。
If you're looking for a command-line tool, they actually recommend using AbiWord to perform the conversion:
如果您正在寻找命令行工具,他们实际上建议使用 AbiWord 来执行转换:
AbiWord --to=txt
If you're looking for a library, start on the wvWare overview page. They also maintain a list of libraries and tools which read MS Office documents.
如果您正在寻找库,请从wvWare 概览页面开始。他们还维护一个可读取 MS Office 文档的库和工具列表。
回答by Telemachus
回答by neves
Same problem here. Below is my simple script to convert all doc files in dir 'docs/' to dir 'txts/' using catdoc. Hope it will help someone:
同样的问题在这里。下面是我使用 catdoc 将 dir 'docs/' 中的所有 doc 文件转换为 dir 'txts/' 的简单脚本。希望它会帮助某人:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import glob, re, os
f = glob.glob('docs/*.doc') + glob.glob('docs/*.DOC')
outDir = 'txts'
if not os.path.exists(outDir):
os.makedirs(outDir)
for i in f:
os.system("catdoc -w '%s' > '%s'" %
(i, outDir + '/' + re.sub(r'.*/([^.]+)\.doc', r'.txt', i,
flags=re.IGNORECASE)))
回答by Dave Webb
For dealing with Excel Spreadsheets xlwtis good. But it won't help with .doc
and .ppt
files.
对于处理 Excel 电子表格xlwt是好的。但它不会帮助.doc
和.ppt
文件。
(You may have also heard of PyExcelerator. xlwt is a fork of this and better maintained so I think you'd be better of with xlwt.)
(你可能也听说过 PyExcelerator。xlwt 是它的一个分支,并且维护得更好,所以我认为你会更好地使用 xlwt。)
回答by D.Shawley
I've had some success at using XSLT to process the XML-based office files into something usable in the past. It's not necessarily a python-based solution, but it does get the job done.
过去,我在使用 XSLT 将基于 XML 的办公文件处理成可用的文件方面取得了一些成功。它不一定是基于 Python 的解决方案,但确实可以完成工作。