Linux 从python中的MS word文件中提取文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/125222/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
extracting text from MS word files in python
提问by Badri
for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux? Is there any library?
为了在 python 中处理 MS word 文件,有 python win32 扩展,可以在 windows 中使用。我如何在 linux 中做同样的事情?有图书馆吗?
采纳答案by John Fouhy
You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.
您可以对antiword进行子进程调用。Antiword 是一个 linux 命令行实用程序,用于从 word 文档中转储文本。适用于简单的文档(显然它会丢失格式)。它可以通过 apt 获得,也可能作为 RPM 获得,或者您可以自己编译它。
回答by Swati
Take a look at how the doc format worksand create word document using PHP in linux. The former is especially useful. Abiwordis my recommended tool. There are limitationsthough:
看看doc 格式是如何工作的,并在 linux 中使用 PHP 创建 word 文档。前者特别有用。Abiword是我推荐的工具。但是有限制:
However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.
但是,如果文档具有复杂的表格、文本框、嵌入的电子表格等,则它可能无法按预期工作。开发良好的 MS Word 过滤器是一个非常困难的过程,因此在我们努力使 Word 文档正确打开时,请耐心等待。如果您有无法加载的 Word 文档,请打开一个错误并包含该文档,以便我们改进导入器。
回答by William Keller
I'm not sure if you're going to have much luck without using COM. The .doc format is ridiculously complex, and is often called a "memory dump" of Word at the time of saving!
我不确定如果不使用 COM,您是否会走运。.doc 格式极其复杂,在保存时通常被称为 Word 的“内存转储”!
At Swati, that's in HTML, which is fine and dandy, but most word documents aren't so nice!
在 Swati,这是在 HTML 中,这很好而且很花哨,但是大多数 Word 文档都不是很好!
回答by Dan Lenski
回答by David
I know this is an old question, but I was recently trying to find a way to extract text from MS word files, and the best solution by far I found was with wvLib:
我知道这是一个老问题,但我最近试图找到一种从 MS Word 文件中提取文本的方法,目前我发现的最佳解决方案是使用 wvLib:
http://wvware.sourceforge.net/
http://wvware.sourceforge.net/
After installing the library, using it in Python is pretty easy:
安装库后,在 Python 中使用它非常简单:
import commands
exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)
And that's it. Pretty much, what we're doing is using the commands.getouput function to run a couple of shell scripts, namely wvText (which extracts text from a Word document, and cat to read the file output). After that, the entire text from the Word document will be in the out variable, ready to use.
就是这样。几乎,我们正在做的是使用 commands.getouput 函数来运行几个 shell 脚本,即 wvText(从 Word 文档中提取文本,并使用 cat 读取文件输出)。之后,Word 文档中的整个文本将在 out 变量中,可供使用。
Hopefully this will help anyone having similar issues in the future.
希望这将有助于将来遇到类似问题的任何人。
回答by Ben Williams
(Note: I posted this on this questionas well, but it seems relevant here, so please excuse the repost.)
(注意:我也在这个问题上发布了这个,但它在这里似乎相关,所以请原谅重新发布。)
Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:
现在,这非常丑陋且非常hacky,但它似乎对我进行基本的文本提取有用。显然,要在 Qt 程序中使用它,您必须为其生成一个进程等,但我一起破解的命令行是:
unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
So that's:
所以那是:
unzip -p file.docx: -p == "unzip to stdout"
unzip -p file.docx: -p == "解压到标准输出"
grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)
grep '<w:t':仅抓取包含 '<w:t' 的行(据我所知,<w:t> 是 Word 2007 XML 中的“文本”元素)
sed 's/<[^<]>//g'*: Remove everything inside tags
sed 's/<[^<]>//g'*:删除标签内的所有内容
grep -v '^[[:space:]]$'*: Remove blank lines
grep -v '^[[:space:]]$'*: 删除空行
There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.
可能有一种更有效的方法来做到这一点,但它似乎对我测试过的少数文档有用。
As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)
据我所知,unzip、grep 和 sed 都有适用于 Windows 和任何 Unix 的端口,所以它应该是合理的跨平台。尽管有点丑陋的黑客 ;)
回答by benjamin
If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.
如果您打算使用纯 python 模块而不调用子进程,则可以使用 zipfile python 模块。
content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
if item.orig_filename == 'word/document.xml':
content = docx.read(item.orig_filename)
else:
pass
Your content string however needs to be cleaned up, one way of doing this is:
但是,您的内容字符串需要清理,一种方法是:
# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
if '>' in item:
bad_good = item.split('>')
if bad_good[-1] != '':
fullyclean.append(bad_good[-1])
else:
pass
else:
pass
# Assemble a new string with all pure content
content = " ".join(fullyclean)
But there is surely a more elegant way to clean up the string, probably using the re module. Hope this helps.
但是肯定有一种更优雅的方法来清理字符串,可能使用 re 模块。希望这可以帮助。
回答by Chad
回答by mikemaccana
Use the native Python docx module. Here's how to extract all the text from a doc:
使用原生 Python docx 模块。以下是从文档中提取所有文本的方法:
document = docx.Document(filename)
docText = '\n\n'.join(
paragraph.text for paragraph in document.paragraphs
)
print(docText)
See Python DocX site
Also check out Textractwhich pulls out tables etc.
还要查看可以拉出表格等的Textract。
Parsing XML with regexs invokes cthulu. Don't do it!
使用正则表达式解析 XML 会调用 cthulu。不要这样做!
回答by fccoelho
Unoconv might also be a good alternative: http://linux.die.net/man/1/unoconv
Unoconv 也可能是一个不错的选择:http://linux.die.net/man/1/unoconv