如何使用 Python 从 doc/docx 文件中提取数据

Question

提问by Stefan Urziceanu

I know there are similar questions out there, but I couldn't find something that would answer my prayers. What I need is a way to access certain data from MS-Word files and save it in an XML file. Reading up on python-docxdid not help, as it only seems to allow one to write into word documents, rather than read. To present my task exactly (or how i chose to approach my task): I would like to search for a key word or phrase in the document (the document contains tables) and extract text data from the table where the key word/phrase is found. Anybody have any ideas?

我知道那里有类似的问题，但我找不到可以回答我祈祷的问题。我需要的是一种从 MS-Word 文件访问某些数据并将其保存在 XML 文件中的方法。阅读python-docx没有帮助，因为它似乎只允许写入 word 文档，而不是阅读。准确地展示我的任务（或我选择如何完成我的任务）：我想在文档中搜索关键字或短语（该文档包含表格）并从关键字/短语所在的表格中提取文本数据成立。有人有任何想法吗？

Answer 1

采纳答案by Stefan Urziceanu

It seems that pywin32 does the trick. You can iterate through all the tables in a document and through all the cells inside a table. It's a bit tricky to get the data (the last 2 characters from every entry have to be omitted), but otherwise, it's a ten minute code. If anyone needs additional details, please say so in the comments.

pywin32 似乎可以解决问题。您可以遍历文档中的所有表格以及表格中的所有单元格。获取数据有点棘手（必须省略每个条目的最后 2 个字符），但除此之外，它是一个十分钟的代码。如果有人需要更多详细信息，请在评论中说明。

Answer 2

回答by edi9999

To search in a document with python-docx

使用 python-docx 在文档中搜索

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

You also have a function to get the text of a document:

您还有一个获取文档文本的函数：

https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')
fullText=getdocumenttext(document)

Using https://github.com/mikemaccana/python-docx

使用https://github.com/mikemaccana/python-docx

Answer 3

回答by Mike Robins

The docx is a zip file containing an XML of the document. You can open the zip, read the document and parse data using ElementTree.

docx 是一个包含文档 XML 的 zip 文件。您可以打开 zip，阅读文档并使用 ElementTree 解析数据。

The advantage of this technique is that you don't need any extra python librariesinstalled.

这种技术的优点是您不需要安装任何额外的 python 库。

import zipfile
import xml.etree.ElementTree

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'

with zipfile.ZipFile('<path to docx file>') as docx:
    tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))

for table in tree.iter(TABLE):
    for row in table.iter(ROW):
        for cell in row.iter(CELL):
            print ''.join(node.text for node in cell.iter(TEXT))

See my stackoverflow answer to How to read contents of an Table in MS-Word file Using Python?for more details and references.

请参阅我的 stackoverflow 回答如何使用 Python 读取 MS-Word 文件中表格的内容？有关更多详细信息和参考。

Answer 4

回答by Krissh

A more simple library with image extraction capability.

具有图像提取功能的更简单的库。

pip install docx2txt

Then use below code to read docx file.

然后使用以下代码读取 docx 文件。

import docx2txt
text = docx2txt.process("file.docx")

如何使用 Python 从 doc/docx 文件中提取数据

提问by Stefan Urziceanu

采纳答案by Stefan Urziceanu

回答by edi9999

回答by Mike Robins

回答by Krissh

相关推荐

最近更新

标签

如何使用 Python 从 doc/docx 文件中提取数据

提问by Stefan Urziceanu

采纳答案by Stefan Urziceanu

回答by edi9999

回答by Mike Robins

回答by Krissh

相关推荐

Python 使用 pandas.read_json 时出现 ValueError

如何在多线程 Python 应用程序中共享单个 SQLite 连接

Python 在 Pandas 数据框中的每一列中打印唯一值

Python Django，保存 ModelForm

相关推荐

最近更新

标签