如何使用python-docx替换Word文档中的文本并保存

Question

提问by user2738815

The oodocx module mentioned in the same page refers the user to an /examples folder that does not seem to be there.
I have read the documentation of python-docx 0.7.2, plus everything I could find in Stackoverflow on the subject, so please believe that I have done my “homework”.

同一页面中提到的 odocx 模块将用户指向一个似乎不存在的 /examples 文件夹。
我已经阅读了 python-docx 0.7.2 的文档，以及我在 Stackoverflow 中可以找到的关于该主题的所有内容，所以请相信我已经完成了我的“作业”。

Python is the only language I know (beginner+, maybe intermediate), so please do not assume any knowledge of C, Unix, xml, etc.

Python 是我所知道的唯一语言（初学者+，可能是中级），所以请不要假设您对 C、Unix、xml 等有任何了解。

Task : Open a ms-word 2007+ document with a single line of text in it (to keep things simple) and replace any “key” word in Dictionary that occurs in that line of text with its dictionary value. Then close the document keeping everything else the same.

任务：打开一个包含单行文本的 ms-word 2007+ 文档（为简单起见），并用字典值替换出现在该文本行中的任何“关键”词。然后关闭文档，保持其他所有内容不变。

Line of text (for example) “We shall linger in the chambers of the sea.”

一行文字（例如）“我们将在大海的房间里逗留。”

from docx import Document

document = Document('/Users/umityalcin/Desktop/Test.docx')

Dictionary = {‘sea': “ocean”}

sections = document.sections
for section in sections:
    print(section.start_type)

#Now, I would like to navigate, focus on, get to, whatever to the section that has my
#single line of text and execute a find/replace using the dictionary above.
#then save the document in the usual way.

document.save('/Users/umityalcin/Desktop/Test.docx')

I am not seeing anything in the documentation that allows me to do this—maybe it is there but I don't get it because everything is not spelled-out at my level.

我在文档中没有看到任何允许我这样做的内容 - 也许它在那里，但我不明白，因为所有内容都没有按照我的水平进行说明。

I have followed other suggestions on this site and have tried to use earlier versions of the module (https://github.com/mikemaccana/python-docx) that is supposed to have "methods like replace, advReplace" as follows: I open the source-code in the python interpreter, and add the following at the end (this is to avoid clashes with the already installed version 0.7.2):

我遵循了本网站上的其他建议，并尝试使用模块的早期版本（https://github.com/mikemaccana/python-docx），该模块应该具有“替换、advReplace 等方法”，如下所示：我打开python解释器中的源代码，并在最后添加以下内容（这是为了避免与已安装的0.7.2版本冲突）：

document = opendocx('/Users/umityalcin/Desktop/Test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
    if word in Dictionary.keys():
        print "found it", Dictionary[word]
        document = replace(document, word, Dictionary[word])
savedocx(document, coreprops, appprops, contenttypes, websettings,
    wordrelationships, output, imagefiledict=None)

Running this produces the following error message:

运行它会产生以下错误消息：

NameError: name 'coreprops' is not defined

NameError：未定义名称“coreprops”

Maybe I am trying to do something that cannot be done—but I would appreciate your help if I am missing something simple.

也许我正在尝试做一些无法完成的事情——但如果我遗漏了一些简单的东西，我会很感激你的帮助。

If this matters, I am using the 64 bit version of Enthought's Canopy on OSX 10.9.3

如果这很重要，我在 OSX 10.9.3 上使用 64 位版本的 Enthought's Canopy

Answer 1

采纳答案by scanny

The current version of python-docx does not have a search()function or a replace()function. These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.

当前版本的 python-docx 没有search()函数或replace()函数。这些请求相当频繁，但一般情况的实现非常棘手，它还没有上升到积压的顶部。

Several folks have had success though, getting done what they need, using the facilities already present. Here's an example. It has nothing to do with sections by the way :)

有几个人已经取得了成功，使用现有的设施完成了他们需要的工作。这是一个例子。顺便说一下，它与部分无关:)

for paragraph in document.paragraphs:
    if 'sea' in paragraph.text:
        print paragraph.text
        paragraph.text = 'new text containing ocean'

To search in Tables as well, you would need to use something like:

要在表格中进行搜索，您需要使用以下内容：

for table in document.tables:
    for cell in table.cells:
        for paragraph in cell.paragraphs:
            if 'sea' in paragraph.text:
               ...

If you pursue this path, you'll probably discover pretty quickly what the complexities are. If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.

如果您走这条路，您可能很快就会发现其中的复杂性。如果您替换一个段落的整个文本，这将删除任何字符级格式，例如粗体或斜体的单词或短语。

By the way, the code from @wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.

顺便说一句，@wnnmaw 的答案中的代码适用于 python-docx 的旧版本，并且根本不适用于 0.3.0 之后的版本。

Answer 2

回答by wnnmaw

The problem with your second attempt is that you haven't defined the parameters that savedocxneeds. You need to do something like this beforeyou save:

您第二次尝试的问题在于您尚未定义所需的参数savedocx。在保存之前，您需要执行以下操作：

relationships = docx.relationshiplist()
title = "Document Title"
subject = "Document Subject"
creator = "Document Creator"
keywords = []

coreprops = docx.coreproperties(title=title, subject=subject, creator=creator,
                       keywords=keywords)
app = docx.appproperties()
content = docx.contenttypes()
web = docx.websettings()
word = docx.wordrelationships(relationships)
output = r"path\to\where\you\want\to\save"

Answer 3

回答by szum

I needed something to replace regular expressions in docx. I took scannys answer. To handle style I've used answer from: Python docx Replace string in paragraph while keeping style added recursive call to handle nested tables. and came up with something like this:

我需要一些东西来替换 docx 中的正则表达式。我接受了scannys的回答。为了处理样式，我使用了以下答案： Python docx 替换段落中的字符串，同时保持样式添加递归调用以处理嵌套表。并想出了这样的事情：

import re
from docx import Document

def docx_replace_regex(doc_obj, regex , replace):

    for p in doc_obj.paragraphs:
        if regex.search(p.text):
            inline = p.runs
            # Loop added to work with runs (strings with same style)
            for i in range(len(inline)):
                if regex.search(inline[i].text):
                    text = regex.sub(replace, inline[i].text)
                    inline[i].text = text

    for table in doc_obj.tables:
        for row in table.rows:
            for cell in row.cells:
                docx_replace_regex(cell, regex , replace)



regex1 = re.compile(r"your regex")
replace1 = r"your replace string"
filename = "test.docx"
doc = Document(filename)
docx_replace_regex(doc, regex1 , replace1)
doc.save('result1.docx')

To iterate over dictionary:

迭代字典：

for word, replacement in dictionary.items():
    word_re=re.compile(word)
    docx_replace_regex(doc, word_re , replacement)

Note that this solution will replace regex only if whole regex has same style in document.

请注意，仅当整个正则表达式在文档中具有相同的样式时，此解决方案才会替换正则表达式。

Also if text is edited after saving same style text might be in separate runs. For example if you open document that has "testabcd" string and you change it to "test1abcd" and save, even dough its the same style there are 3 separate runs "test", "1", and "abcd", in this case replacement of test1 won't work.

此外，如果在保存相同样式的文本后编辑文本可能会在单独的运行中。例如，如果您打开具有“testabcd”字符串的文档并将其更改为“test1abcd”并保存，即使面团的样式相同，也会有 3 个单独的运行“test”、“1”和“abcd”，在这种情况下替换 test1 将不起作用。

This is for tracking changes in the document. To marge it to one run, in Word you need to go to "Options", "Trust Center" and in "Privacy Options" unthick "Store random numbers to improve combine accuracy" and save the document.

这是用于跟踪文档中的更改。要将其调整为一次运行，在 Word 中，您需要转到“选项”、“信任中心”并在“隐私选项”中取消“存储随机数以提高组合准确性”并保存文档。

Answer 4

回答by Soferio

The Office Dev Centre has an entry in which a developer has published (MIT licenced at this time) a description of a couple of algorithms that appear to suggest a solution for this (albeit in C#, and require porting):" MS Dev Centre posting

Office 开发人员中心有一个条目，其中开发人员已发布（此时获得 MIT 许可）对几种算法的描述，这些算法似乎为此提出了解决方案（尽管是在 C# 中，并且需要移植）：” MS 开发中心发帖

Answer 5

回答by Toskan

he changed the API in docx py again...

他再次更改了 docx py 中的 API ......

for the sanity of everyone coming here:

为了每个来到这里的人的理智：

import datetime
import os
from decimal import Decimal
from typing import NamedTuple

from docx import Document
from docx.document import Document as nDocument


class DocxInvoiceArg(NamedTuple):
  invoice_to: str
  date_from: str
  date_to: str
  project_name: str
  quantity: float
  hourly: int
  currency: str
  bank_details: str


class DocxService():
  tokens = [
    '@INVOICE_TO@',
    '@IDATE_FROM@',
    '@IDATE_TO@',
    '@INVOICE_NR@',
    '@PROJECTNAME@',
    '@QUANTITY@',
    '@HOURLY@',
    '@CURRENCY@',
    '@TOTAL@',
    '@BANK_DETAILS@',
  ]

  def __init__(self, replace_vals: DocxInvoiceArg):
    total = replace_vals.quantity * replace_vals.hourly
    invoice_nr = replace_vals.project_name + datetime.datetime.strptime(replace_vals.date_to, '%Y-%m-%d').strftime('%Y%m%d')
    self.replace_vals = [
      {'search': self.tokens[0], 'replace': replace_vals.invoice_to },
      {'search': self.tokens[1], 'replace': replace_vals.date_from },
      {'search': self.tokens[2], 'replace': replace_vals.date_to },
      {'search': self.tokens[3], 'replace': invoice_nr },
      {'search': self.tokens[4], 'replace': replace_vals.project_name },
      {'search': self.tokens[5], 'replace': replace_vals.quantity },
      {'search': self.tokens[6], 'replace': replace_vals.hourly },
      {'search': self.tokens[7], 'replace': replace_vals.currency },
      {'search': self.tokens[8], 'replace': total },
      {'search': self.tokens[9], 'replace': 'asdfasdfasdfdasf'},
    ]
    self.doc_path_template = os.path.dirname(os.path.realpath(__file__))+'/docs/'
    self.doc_path_output = self.doc_path_template + 'output/'
    self.document: nDocument = Document(self.doc_path_template + 'invoice_placeholder.docx')


  def save(self):
    for p in self.document.paragraphs:
      self._docx_replace_text(p)
    tables = self.document.tables
    self._loop_tables(tables)
    self.document.save(self.doc_path_output + 'testiboi3.docx')

  def _loop_tables(self, tables):
    for table in tables:
      for index, row in enumerate(table.rows):
        for cell in table.row_cells(index):
          if cell.tables:
            self._loop_tables(cell.tables)
          for p in cell.paragraphs:
            self._docx_replace_text(p)

        # for cells in column.
        # for cell in table.columns:

  def _docx_replace_text(self, p):
    print(p.text)
    for el in self.replace_vals:
      if (el['search'] in p.text):
        inline = p.runs
        # Loop added to work with runs (strings with same style)
        for i in range(len(inline)):
          print(inline[i].text)
          if el['search'] in inline[i].text:
            text = inline[i].text.replace(el['search'], str(el['replace']))
            inline[i].text = text
        print(p.text)

Test case:

测试用例：

from django.test import SimpleTestCase
from docx.table import Table, _Rows

from toggleapi.services.DocxService import DocxService, DocxInvoiceArg


class TestDocxService(SimpleTestCase):

  def test_document_read(self):
    ds = DocxService(DocxInvoiceArg(invoice_to="""
    WAW test1
    Multi myfriend
    """,date_from="2019-08-01", date_to="2019-08-30", project_name='WAW', quantity=10.5, hourly=40, currency='USD',bank_details="""
    Paypal to:
    [email protected]"""))

    ds.save()

have folders docsand docs/output/in same folder where you have DocxService.py

有文件夹 docs并且 docs/output/在您拥有的同一文件夹中DocxService.py

e.g.

例如

be sure to parameterize and replace stuff

一定要参数化和替换东西

Answer 6

回答by Basj

For the table case, I had to modify @scanny's answer to:

对于表格案例，我不得不将@scanny 的答案修改为：

for table in doc.tables:
    for col in table.columns:
        for cell in col.cells:
            for p in cell.paragraphs:

to make it work. Indeed, this does not seem to work with the current state of the API:

使其工作。事实上，这似乎不适用于 API 的当前状态：

for table in document.tables:
    for cell in table.cells:

Same problem with the code from here: https://github.com/python-openxml/python-docx/issues/30#issuecomment-38658149

这里的代码有同样的问题：https: //github.com/python-openxml/python-docx/issues/30#issuecomment-38658149

Answer 7

回答by poin

I got much help from answers from the earlier, but for me, the below code functions as the simple find and replace function in word would do. Hope this helps.

我从前面的答案中得到了很多帮助，但对我来说，下面的代码功能就像 word 中的简单查找和替换功能一样。希望这可以帮助。

#!pip install python-docx
#start from here if python-docx is installed
from docx import Document
#open the document
doc=Document('./test.docx')
Dictionary = {"sea": "ocean", "find_this_text":"new_text"}
for i in Dictionary:
    for p in doc.paragraphs:
        if p.text.find(i)>=0:
            p.text=p.text.replace(i,Dictionary[i])
#save changed document
doc.save('./test.docx')

The above solution has limitations. 1) The paragraph containing The "find_this_text" will became plain text without any format, 2) context controls that are in the same paragraph with the "find_this_text" will be deleted, and 3) the "find_this_text" in either context controls or tables will not be changed.

上述解决方案有局限性。1) 包含“find_this_text”的段落将变成没有任何格式的纯文本，2) 与“find_this_text”在同一段落中的上下文控件将被删除，3) 上下文控件或表格中的“find_this_text”将被删除不被改变。

如何使用python-docx替换Word文档中的文本并保存

提问by user2738815

采纳答案by scanny

回答by wnnmaw

回答by szum

回答by Soferio

回答by Toskan

回答by Basj

回答by poin

相关推荐

最近更新

标签

如何使用python-docx替换Word文档中的文本并保存

提问by user2738815

采纳答案by scanny

回答by wnnmaw

回答by szum

回答by Soferio

回答by Toskan

回答by Basj

回答by poin

相关推荐

Python 实时绘图

Python 使用 Sklearn 对 Pandas DataFrame 进行线性回归（IndexError：元组索引超出范围）

Python 从日期时间 <[M8] 在 Pandas 中删除时间

Python 为什么我收到无效的语法错误？

相关推荐

最近更新

标签