如何使用 python-docx 从现有的 docx 文件中提取文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25228106/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:54:25  来源:igfitidea点击:

How to extract text from an existing docx file using python-docx

pythonpython-2.7python-3.xpython-docx

提问by Nancy

I'm trying to use python-docxmodule (pip install python-docx) but it seems to be very confusing as in github repotest sample they are using opendocxfunction but in readthedocsthey are using Documentclass. Even they are only showing how to add text to a docx file not reading existing one?

我正在尝试使用python-docxmodule ( pip install python-docx) 但它似乎非常混乱,因为在github repo测试示例中他们使用的是opendocx函数,但在readthedocs 中他们使用的是Document类。即使他们只展示了如何将文本添加到 docx 文件而不阅读现有文件?

1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:

第一个 ( opendocx) 不起作用,可能会被弃用。对于第二种情况,我试图使用:

from docx import Document

document = Document('test_doc.docx')

print document.paragraphs

It returned a list of <docx.text.Paragraph object at 0x... >

它返回了一个列表 <docx.text.Paragraph object at 0x... >

Then I did:

然后我做了:

for p in document.paragraphs:
    print p.text

It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.

它返回了所有文本,但几乎没有丢失的东西。控制台上的文本中不存在所有 URL(CTRL+单击转到 URL)。

What is the issue? Why URLs are missing?

问题是什么?为什么缺少网址?

How could I get complete text without iterating over loop (something like open().read())

我怎样才能在不迭代循环的情况下获得完整的文本(类似于open().read()

回答by scanny

There are two "generations" of python-docx. The initial generation ended with the 0.2.x versions and the "new" generation started at v0.3.0. The new generation is a ground-up, object-oriented rewrite of the legacy version. It has a distinct repository located here.

python-docx有两个“世代”。最初的一代以 0.2.x 版本结束,“新一代”开始于 v0.3.0。新一代是对旧版本的彻底的、面向对象的重写。它有一个位于此处独特存储库

The opendocx() function is part of the legacy API. The documentation is for the new version. The legacy version has no documentation to speak of.

opendocx() 函数是旧 API 的一部分。该文档适用于新版本。旧版本没有文档可言。

Neither reading nor writing hyperlinks are supported in the current version. That capability is on the roadmap, and the project is under active development. It turns out to be quite a broad API because Word has so much functionality. So we'll get to it, but probably not in the next month unless someone decides to focus on that aspect and contribute it.UPDATEHyperlink support was added subsequent to this answer.

当前版本不支持读取和写入超链接。该功能在路线图上,该项目正在积极开发中。事实证明,它是一个相当广泛的 API,因为 Word 具有如此多的功能。所以我们会解决这个问题,但可能不会在下个月,除非有人决定专注于这方面并做出贡献。在此答案之后添加了更新超链接支持。

回答by user4264327

I had a similar issue so I found a workaround (remove hyperlink tags thanks to regular expressions so that only a paragraph tag remains). I posted this solution on https://github.com/python-openxml/python-docx/issues/85BP

我有一个类似的问题,所以我找到了一个解决方法(由于正则表达式删除超链接标签,以便只保留一个段落标签)。我在https://github.com/python-openxml/python-docx/issues/85BP上发布了这个解决方案

回答by Ankush Shah

You can use python-docx2txtwhich is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.

您可以使用改编自 python-docx 的python-docx2txt,但也可以从链接、页眉和页脚中提取文本。它还可以提取图像。

回答by Chinmoy Panda

you can try this

你可以试试这个

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

回答by user3732708

you can try this also

你也可以试试这个

from docx import Document

document = Document('demo.docx')
for para in document.paragraphs:
    print(para.text)

回答by imanzabet

Without Installing python-docx

无需安装 python-docx

docxis basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docxfile, without need to install python-docxand lxmlwhich sometimes create problem:

docx基本上是一个 zip 文件,其中包含多个文件夹和文件。在下面的链接,你可以找到一个简单的函数来提取文本docx文件,而不需要安装python-docxlxml它有时会产生问题:

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

回答by Xing Shi

Using python-docx, as @Chinmoy Panda 's answer shows:

使用 python-docx,如@Chinmoy Panda 的回答所示:

for para in doc.paragraphs:
    fullText.append(para.text)

However, para.textwill lost the text in w:smarttag(Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:

但是,para.text会丢失w:smarttag(相应的 github 问题在这里:https: //github.com/python-openxml/python-docx/issues/328)中的文本,您应该使用以下函数:

def para2text(p):
    rs = p._element.xpath('.//w:t')
    return u" ".join([r.text for r in rs])