从python中的MS Word文件中提取文本-IGI

时间：2020-03-06 14:38:26 　来源:igfitidea点击:

为了在python中处理MS Word文件，有python win32扩展名，可以在Windows中使用。我该如何在linux中做同样的事情？
有图书馆吗？

解决方案

看一下doc格式是如何工作的，并在Linux中使用PHP创建Word文档。前者特别有用。 Abiword是我推荐的工具。但是有一些限制：

However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.

我不确定如果不使用COM，我们是否会遇到很多麻烦。 .doc格式非常复杂，在保存时通常被称为Word的"内存转储"！

在Swati，这是用HTML编写的，既不错又花哨，但是大多数word文档都不是那么好！

可以使用Python编写OpenOffice.org脚本：请参见此处。

由于OOo可以完美地加载大多数MS Word文件，所以我认为这是最好的选择。

我们可以对antiword进行子流程调用。 Antiword是一个Linux命令行实用程序，用于从word doc中转储文本。适用于简单文档(显然会丢失格式)。它可以通过apt或者RPM来获得，或者我们可以自己编译。

我知道这是一个老问题，但是最近我试图找到一种从MS Word文件中提取文本的方法，到目前为止，发现的最佳解决方案是wvLib：

http://wvware.sourceforge.net/

安装该库之后，在Python中使用它非常简单：

import commands

exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)

就是这样。差不多，我们正在使用commands.getouput函数运行几个shell脚本，即wvText(从Word文档中提取文本，然后读取文件输出)。之后，Word文档中的整个文本将位于out变量中，可供使用。

希望这将对将来遇到类似问题的任何人有所帮助。

(注意：我也在此问题上发布了此内容，但此处似乎与此相关，因此请原谅。)

现在，这很丑陋而且很hacky，但是对于基本的文本提取它似乎对我有用。显然，要在Qt程序中使用它，我们必须为其生成一个进程等，但是我一起破解的命令行是：

unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

所以那是：

解压缩-p file.docx：-p =="解压缩至标准输出"

grep'<w：t'：仅抓取包含'<w：t'的行(据我所知，<w：t>是Word 2007 XML元素的"文本")

sed's / <[^ <]> // g'*：删除标签内的所有内容

grep -v'^ :::$'*：删除空白行

可能有一种更有效的方法来执行此操作，但它似乎对我测试过的一些文档有效。

据我所知，解压缩，grep和sed都具有Windows和任何Unix的端口，因此应该合理地跨平台。 Despit有点丑陋；)

如果我们打算使用纯python模块而不调用子进程，则可以使用zipfile python modude。

content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
    if item.orig_filename == 'word/document.xml':
        content = docx.read(item.orig_filename)

    else:
        pass

但是，内容字符串需要清理，一种方法是：

# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
    if '>' in item:
        bad_good = item.split('>')
        if bad_good[-1] != '':
            fullyclean.append(bad_good[-1])
        else:
            pass
    else:
        pass

# Assemble a new string with all pure content
content = " ".join(fullyclean)

但是肯定有一种更优雅的方式来清理字符串，可能使用re模块。
希望这可以帮助。

本杰明的答案是一个很好的答案。我刚刚合并了...

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)

使用本机Python docx模块。以下是从文档中提取所有文本的方法：

document = docx.Document(filename)
docText = '\n\n'.join([
    paragraph.text.encode('utf-8') for paragraph in document.paragraphs
])
print docText

请参阅Python DocX网站

还要检查Textract，它可以拉出桌子等。

使用正则表达式解析XML会调用cthulu。不要做！

从python中的MS Word文件中提取文本

解决方案

相关推荐

最近更新

标签

从python中的MS Word文件中提取文本

解决方案

相关推荐

具有行号的Windows窗体文本框？

切换到站立式办公桌

初学者寻找漂亮且有指导意义的Python代码

在XP / IIS 5.1上安装PHP？

相关推荐

最近更新

标签