如何使用 python-docx 从现有的 docx 文件中提取文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25228106/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract text from an existing docx file using python-docx
提问by Nancy
I'm trying to use python-docxmodule (pip install python-docx)
but it seems to be very confusing as in github repotest sample they are using opendocxfunction but in readthedocsthey are using Documentclass. Even they are only showing how to add text to a docx file not reading existing one?
我正在尝试使用python-docxmodule ( pip install python-docx) 但它似乎非常混乱,因为在github repo测试示例中他们使用的是opendocx函数,但在readthedocs 中他们使用的是Document类。即使他们只展示了如何将文本添加到 docx 文件而不阅读现有文件?
1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:
第一个 ( opendocx) 不起作用,可能会被弃用。对于第二种情况,我试图使用:
from docx import Document
document = Document('test_doc.docx')
print document.paragraphs
It returned a list of <docx.text.Paragraph object at 0x... >
它返回了一个列表 <docx.text.Paragraph object at 0x... >
Then I did:
然后我做了:
for p in document.paragraphs:
print p.text
It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.
它返回了所有文本,但几乎没有丢失的东西。控制台上的文本中不存在所有 URL(CTRL+单击转到 URL)。
What is the issue? Why URLs are missing?
问题是什么?为什么缺少网址?
How could I get complete text without iterating over loop (something like open().read())
我怎样才能在不迭代循环的情况下获得完整的文本(类似于open().read())
回答by scanny
There are two "generations" of python-docx. The initial generation ended with the 0.2.x versions and the "new" generation started at v0.3.0. The new generation is a ground-up, object-oriented rewrite of the legacy version. It has a distinct repository located here.
python-docx有两个“世代”。最初的一代以 0.2.x 版本结束,“新一代”开始于 v0.3.0。新一代是对旧版本的彻底的、面向对象的重写。它有一个位于此处的独特存储库。
The opendocx() function is part of the legacy API. The documentation is for the new version. The legacy version has no documentation to speak of.
opendocx() 函数是旧 API 的一部分。该文档适用于新版本。旧版本没有文档可言。
Neither reading nor writing hyperlinks are supported in the current version. That capability is on the roadmap, and the project is under active development. It turns out to be quite a broad API because Word has so much functionality. So we'll get to it, but probably not in the next month unless someone decides to focus on that aspect and contribute it.UPDATEHyperlink support was added subsequent to this answer.
当前版本不支持读取和写入超链接。该功能在路线图上,该项目正在积极开发中。事实证明,它是一个相当广泛的 API,因为 Word 具有如此多的功能。所以我们会解决这个问题,但可能不会在下个月,除非有人决定专注于这方面并做出贡献。在此答案之后添加了更新超链接支持。
回答by user4264327
I had a similar issue so I found a workaround (remove hyperlink tags thanks to regular expressions so that only a paragraph tag remains). I posted this solution on https://github.com/python-openxml/python-docx/issues/85BP
我有一个类似的问题,所以我找到了一个解决方法(由于正则表达式删除超链接标签,以便只保留一个段落标签)。我在https://github.com/python-openxml/python-docx/issues/85BP上发布了这个解决方案
回答by Ankush Shah
You can use python-docx2txtwhich is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.
您可以使用改编自 python-docx 的python-docx2txt,但也可以从链接、页眉和页脚中提取文本。它还可以提取图像。
回答by Chinmoy Panda
you can try this
你可以试试这个
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
回答by user3732708
you can try this also
你也可以试试这个
from docx import Document
document = Document('demo.docx')
for para in document.paragraphs:
print(para.text)
回答by imanzabet
Without Installing python-docx
无需安装 python-docx
docxis basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docxfile, without need to install python-docxand lxmlwhich sometimes create problem:
docx基本上是一个 zip 文件,其中包含多个文件夹和文件。在下面的链接,你可以找到一个简单的函数来提取文本docx文件,而不需要安装python-docx和lxml它有时会产生问题:
http://etienned.github.io/posts/extract-text-from-word-docx-simply/
http://etienned.github.io/posts/extract-text-from-word-docx-simply/
回答by Xing Shi
Using python-docx, as @Chinmoy Panda 's answer shows:
使用 python-docx,如@Chinmoy Panda 的回答所示:
for para in doc.paragraphs:
fullText.append(para.text)
However, para.textwill lost the text in w:smarttag(Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:
但是,para.text会丢失w:smarttag(相应的 github 问题在这里:https: //github.com/python-openxml/python-docx/issues/328)中的文本,您应该使用以下函数:
def para2text(p):
rs = p._element.xpath('.//w:t')
return u" ".join([r.text for r in rs])

