如何使用python-docx替换Word文档中的文本并保存
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24805671/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use python-docx to replace text in a Word document and save
提问by user2738815
The oodocx module mentioned in the same page refers the user to an /examples folder that does not seem to be there.
I have read the documentation of python-docx 0.7.2, plus everything I could find in Stackoverflow on the subject, so please believe that I have done my “homework”.
同一页面中提到的 odocx 模块将用户指向一个似乎不存在的 /examples 文件夹。
我已经阅读了 python-docx 0.7.2 的文档,以及我在 Stackoverflow 中可以找到的关于该主题的所有内容,所以请相信我已经完成了我的“作业”。
Python is the only language I know (beginner+, maybe intermediate), so please do not assume any knowledge of C, Unix, xml, etc.
Python 是我所知道的唯一语言(初学者+,可能是中级),所以请不要假设您对 C、Unix、xml 等有任何了解。
Task : Open a ms-word 2007+ document with a single line of text in it (to keep things simple) and replace any “key” word in Dictionary that occurs in that line of text with its dictionary value. Then close the document keeping everything else the same.
任务:打开一个包含单行文本的 ms-word 2007+ 文档(为简单起见),并用字典值替换出现在该文本行中的任何“关键”词。然后关闭文档,保持其他所有内容不变。
Line of text (for example) “We shall linger in the chambers of the sea.”
一行文字(例如)“我们将在大海的房间里逗留。”
from docx import Document
document = Document('/Users/umityalcin/Desktop/Test.docx')
Dictionary = {‘sea': “ocean”}
sections = document.sections
for section in sections:
print(section.start_type)
#Now, I would like to navigate, focus on, get to, whatever to the section that has my
#single line of text and execute a find/replace using the dictionary above.
#then save the document in the usual way.
document.save('/Users/umityalcin/Desktop/Test.docx')
I am not seeing anything in the documentation that allows me to do this—maybe it is there but I don't get it because everything is not spelled-out at my level.
我在文档中没有看到任何允许我这样做的内容 - 也许它在那里,但我不明白,因为所有内容都没有按照我的水平进行说明。
I have followed other suggestions on this site and have tried to use earlier versions of the module (https://github.com/mikemaccana/python-docx) that is supposed to have "methods like replace, advReplace" as follows: I open the source-code in the python interpreter, and add the following at the end (this is to avoid clashes with the already installed version 0.7.2):
我遵循了本网站上的其他建议,并尝试使用模块的早期版本(https://github.com/mikemaccana/python-docx),该模块应该具有“替换、advReplace 等方法”,如下所示:我打开python解释器中的源代码,并在最后添加以下内容(这是为了避免与已安装的0.7.2版本冲突):
document = opendocx('/Users/umityalcin/Desktop/Test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
if word in Dictionary.keys():
print "found it", Dictionary[word]
document = replace(document, word, Dictionary[word])
savedocx(document, coreprops, appprops, contenttypes, websettings,
wordrelationships, output, imagefiledict=None)
Running this produces the following error message:
运行它会产生以下错误消息:
NameError: name 'coreprops' is not defined
NameError:未定义名称“coreprops”
Maybe I am trying to do something that cannot be done—but I would appreciate your help if I am missing something simple.
也许我正在尝试做一些无法完成的事情——但如果我遗漏了一些简单的东西,我会很感激你的帮助。
If this matters, I am using the 64 bit version of Enthought's Canopy on OSX 10.9.3
如果这很重要,我在 OSX 10.9.3 上使用 64 位版本的 Enthought's Canopy
采纳答案by scanny
The current version of python-docx does not have a search()
function or a replace()
function. These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.
当前版本的 python-docx 没有search()
函数或replace()
函数。这些请求相当频繁,但一般情况的实现非常棘手,它还没有上升到积压的顶部。
Several folks have had success though, getting done what they need, using the facilities already present. Here's an example. It has nothing to do with sections by the way :)
有几个人已经取得了成功,使用现有的设施完成了他们需要的工作。这是一个例子。顺便说一下,它与部分无关:)
for paragraph in document.paragraphs:
if 'sea' in paragraph.text:
print paragraph.text
paragraph.text = 'new text containing ocean'
To search in Tables as well, you would need to use something like:
要在表格中进行搜索,您需要使用以下内容:
for table in document.tables:
for cell in table.cells:
for paragraph in cell.paragraphs:
if 'sea' in paragraph.text:
...
If you pursue this path, you'll probably discover pretty quickly what the complexities are. If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.
如果您走这条路,您可能很快就会发现其中的复杂性。如果您替换一个段落的整个文本,这将删除任何字符级格式,例如粗体或斜体的单词或短语。
By the way, the code from @wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.
顺便说一句,@wnnmaw 的答案中的代码适用于 python-docx 的旧版本,并且根本不适用于 0.3.0 之后的版本。
回答by wnnmaw
The problem with your second attempt is that you haven't defined the parameters that savedocx
needs. You need to do something like this beforeyou save:
您第二次尝试的问题在于您尚未定义所需的参数savedocx
。在保存之前,您需要执行以下操作:
relationships = docx.relationshiplist()
title = "Document Title"
subject = "Document Subject"
creator = "Document Creator"
keywords = []
coreprops = docx.coreproperties(title=title, subject=subject, creator=creator,
keywords=keywords)
app = docx.appproperties()
content = docx.contenttypes()
web = docx.websettings()
word = docx.wordrelationships(relationships)
output = r"path\to\where\you\want\to\save"
回答by szum
I needed something to replace regular expressions in docx. I took scannys answer. To handle style I've used answer from: Python docx Replace string in paragraph while keeping style added recursive call to handle nested tables. and came up with something like this:
我需要一些东西来替换 docx 中的正则表达式。我接受了scannys的回答。为了处理样式,我使用了以下答案: Python docx 替换段落中的字符串,同时保持样式 添加递归调用以处理嵌套表。并想出了这样的事情:
import re
from docx import Document
def docx_replace_regex(doc_obj, regex , replace):
for p in doc_obj.paragraphs:
if regex.search(p.text):
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
if regex.search(inline[i].text):
text = regex.sub(replace, inline[i].text)
inline[i].text = text
for table in doc_obj.tables:
for row in table.rows:
for cell in row.cells:
docx_replace_regex(cell, regex , replace)
regex1 = re.compile(r"your regex")
replace1 = r"your replace string"
filename = "test.docx"
doc = Document(filename)
docx_replace_regex(doc, regex1 , replace1)
doc.save('result1.docx')
To iterate over dictionary:
迭代字典:
for word, replacement in dictionary.items():
word_re=re.compile(word)
docx_replace_regex(doc, word_re , replacement)
Note that this solution will replace regex only if whole regex has same style in document.
请注意,仅当整个正则表达式在文档中具有相同的样式时,此解决方案才会替换正则表达式。
Also if text is edited after saving same style text might be in separate runs. For example if you open document that has "testabcd" string and you change it to "test1abcd" and save, even dough its the same style there are 3 separate runs "test", "1", and "abcd", in this case replacement of test1 won't work.
此外,如果在保存相同样式的文本后编辑文本可能会在单独的运行中。例如,如果您打开具有“testabcd”字符串的文档并将其更改为“test1abcd”并保存,即使面团的样式相同,也会有 3 个单独的运行“test”、“1”和“abcd”,在这种情况下替换 test1 将不起作用。
This is for tracking changes in the document. To marge it to one run, in Word you need to go to "Options", "Trust Center" and in "Privacy Options" unthick "Store random numbers to improve combine accuracy" and save the document.
这是用于跟踪文档中的更改。要将其调整为一次运行,在 Word 中,您需要转到“选项”、“信任中心”并在“隐私选项”中取消“存储随机数以提高组合准确性”并保存文档。
回答by Soferio
The Office Dev Centre has an entry in which a developer has published (MIT licenced at this time) a description of a couple of algorithms that appear to suggest a solution for this (albeit in C#, and require porting):" MS Dev Centre posting
Office 开发人员中心有一个条目,其中开发人员已发布(此时获得 MIT 许可)对几种算法的描述,这些算法似乎为此提出了解决方案(尽管是在 C# 中,并且需要移植):” MS 开发中心发帖
回答by Toskan
he changed the API in docx py again...
他再次更改了 docx py 中的 API ......
for the sanity of everyone coming here:
为了每个来到这里的人的理智:
import datetime
import os
from decimal import Decimal
from typing import NamedTuple
from docx import Document
from docx.document import Document as nDocument
class DocxInvoiceArg(NamedTuple):
invoice_to: str
date_from: str
date_to: str
project_name: str
quantity: float
hourly: int
currency: str
bank_details: str
class DocxService():
tokens = [
'@INVOICE_TO@',
'@IDATE_FROM@',
'@IDATE_TO@',
'@INVOICE_NR@',
'@PROJECTNAME@',
'@QUANTITY@',
'@HOURLY@',
'@CURRENCY@',
'@TOTAL@',
'@BANK_DETAILS@',
]
def __init__(self, replace_vals: DocxInvoiceArg):
total = replace_vals.quantity * replace_vals.hourly
invoice_nr = replace_vals.project_name + datetime.datetime.strptime(replace_vals.date_to, '%Y-%m-%d').strftime('%Y%m%d')
self.replace_vals = [
{'search': self.tokens[0], 'replace': replace_vals.invoice_to },
{'search': self.tokens[1], 'replace': replace_vals.date_from },
{'search': self.tokens[2], 'replace': replace_vals.date_to },
{'search': self.tokens[3], 'replace': invoice_nr },
{'search': self.tokens[4], 'replace': replace_vals.project_name },
{'search': self.tokens[5], 'replace': replace_vals.quantity },
{'search': self.tokens[6], 'replace': replace_vals.hourly },
{'search': self.tokens[7], 'replace': replace_vals.currency },
{'search': self.tokens[8], 'replace': total },
{'search': self.tokens[9], 'replace': 'asdfasdfasdfdasf'},
]
self.doc_path_template = os.path.dirname(os.path.realpath(__file__))+'/docs/'
self.doc_path_output = self.doc_path_template + 'output/'
self.document: nDocument = Document(self.doc_path_template + 'invoice_placeholder.docx')
def save(self):
for p in self.document.paragraphs:
self._docx_replace_text(p)
tables = self.document.tables
self._loop_tables(tables)
self.document.save(self.doc_path_output + 'testiboi3.docx')
def _loop_tables(self, tables):
for table in tables:
for index, row in enumerate(table.rows):
for cell in table.row_cells(index):
if cell.tables:
self._loop_tables(cell.tables)
for p in cell.paragraphs:
self._docx_replace_text(p)
# for cells in column.
# for cell in table.columns:
def _docx_replace_text(self, p):
print(p.text)
for el in self.replace_vals:
if (el['search'] in p.text):
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
print(inline[i].text)
if el['search'] in inline[i].text:
text = inline[i].text.replace(el['search'], str(el['replace']))
inline[i].text = text
print(p.text)
Test case:
测试用例:
from django.test import SimpleTestCase
from docx.table import Table, _Rows
from toggleapi.services.DocxService import DocxService, DocxInvoiceArg
class TestDocxService(SimpleTestCase):
def test_document_read(self):
ds = DocxService(DocxInvoiceArg(invoice_to="""
WAW test1
Multi myfriend
""",date_from="2019-08-01", date_to="2019-08-30", project_name='WAW', quantity=10.5, hourly=40, currency='USD',bank_details="""
Paypal to:
[email protected]"""))
ds.save()
have folders
docs
and
docs/output/
in same folder where you have DocxService.py
有文件夹
docs
并且
docs/output/
在您拥有的同一文件夹中DocxService.py
e.g.
例如
be sure to parameterize and replace stuff
一定要参数化和替换东西
回答by Basj
For the table case, I had to modify @scanny's answer to:
对于表格案例,我不得不将@scanny 的答案修改为:
for table in doc.tables:
for col in table.columns:
for cell in col.cells:
for p in cell.paragraphs:
to make it work. Indeed, this does not seem to work with the current state of the API:
使其工作。事实上,这似乎不适用于 API 的当前状态:
for table in document.tables:
for cell in table.cells:
Same problem with the code from here: https://github.com/python-openxml/python-docx/issues/30#issuecomment-38658149
这里的代码有同样的问题:https: //github.com/python-openxml/python-docx/issues/30#issuecomment-38658149
回答by poin
I got much help from answers from the earlier, but for me, the below code functions as the simple find and replace function in word would do. Hope this helps.
我从前面的答案中得到了很多帮助,但对我来说,下面的代码功能就像 word 中的简单查找和替换功能一样。希望这可以帮助。
#!pip install python-docx
#start from here if python-docx is installed
from docx import Document
#open the document
doc=Document('./test.docx')
Dictionary = {"sea": "ocean", "find_this_text":"new_text"}
for i in Dictionary:
for p in doc.paragraphs:
if p.text.find(i)>=0:
p.text=p.text.replace(i,Dictionary[i])
#save changed document
doc.save('./test.docx')
The above solution has limitations. 1) The paragraph containing The "find_this_text" will became plain text without any format, 2) context controls that are in the same paragraph with the "find_this_text" will be deleted, and 3) the "find_this_text" in either context controls or tables will not be changed.
上述解决方案有局限性。1) 包含“find_this_text”的段落将变成没有任何格式的纯文本,2) 与“find_this_text”在同一段落中的上下文控件将被删除,3) 上下文控件或表格中的“find_this_text”将被删除不被改变。