如何使用 Python 从 doc/docx 文件中提取数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22756344/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:36:27  来源:igfitidea点击:

How do I extract data from a doc/docx file using Python

pythonms-worddocxdoc

提问by Stefan Urziceanu

I know there are similar questions out there, but I couldn't find something that would answer my prayers. What I need is a way to access certain data from MS-Word files and save it in an XML file. Reading up on python-docxdid not help, as it only seems to allow one to write into word documents, rather than read. To present my task exactly (or how i chose to approach my task): I would like to search for a key word or phrase in the document (the document contains tables) and extract text data from the table where the key word/phrase is found. Anybody have any ideas?

我知道那里有类似的问题,但我找不到可以回答我祈祷的问题。我需要的是一种从 MS-Word 文件访问某些数据并将其保存在 XML 文件中的方法。阅读python-docx没有帮助,因为它似乎只允许写入 word 文档,而不是阅读。准确地展示我的任务(或我选择如何完成我的任务):我想在文档中搜索关键字或短语(该文档包含表格)并从关键字/短语所在的表格中提取文本数据成立。有人有任何想法吗?

采纳答案by Stefan Urziceanu

It seems that pywin32 does the trick. You can iterate through all the tables in a document and through all the cells inside a table. It's a bit tricky to get the data (the last 2 characters from every entry have to be omitted), but otherwise, it's a ten minute code. If anyone needs additional details, please say so in the comments.

pywin32 似乎可以解决问题。您可以遍历文档中的所有表格以及表格中的所有单元格。获取数据有点棘手(必须省略每个条目的最后 2 个字符),但除此之外,它是一个十分钟的代码。如果有人需要更多详细信息,请在评论中说明。

回答by edi9999

To search in a document with python-docx

使用 python-docx 在文档中搜索

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

You also have a function to get the text of a document:

您还有一个获取文档文本的函数:

https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910

https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')
fullText=getdocumenttext(document)

Using https://github.com/mikemaccana/python-docx

使用https://github.com/mikemaccana/python-docx

回答by Mike Robins

The docx is a zip file containing an XML of the document. You can open the zip, read the document and parse data using ElementTree.

docx 是一个包含文档 XML 的 zip 文件。您可以打开 zip,阅读文档并使用 ElementTree 解析数据。

The advantage of this technique is that you don't need any extra python librariesinstalled.

这种技术的优点是您不需要安装任何额外的 python 库

import zipfile
import xml.etree.ElementTree

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'

with zipfile.ZipFile('<path to docx file>') as docx:
    tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))

for table in tree.iter(TABLE):
    for row in table.iter(ROW):
        for cell in row.iter(CELL):
            print ''.join(node.text for node in cell.iter(TEXT))

See my stackoverflow answer to How to read contents of an Table in MS-Word file Using Python?for more details and references.

请参阅我的 stackoverflow 回答如何使用 Python 读取 MS-Word 文件中表格的内容?有关更多详细信息和参考。

回答by Krissh

A more simple library with image extraction capability.

具有图像提取功能的更简单的库。

pip install docx2txt


Then use below code to read docx file.


然后使用以下代码读取 docx 文件。

import docx2txt
text = docx2txt.process("file.docx")