vba 从 MS Word 中提取数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/505925/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-11 10:12:56  来源:igfitidea点击:

Extracting data from MS Word

pythonvbams-wordword-vbapywin32

提问by Technical Bard

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.

我正在寻找一种将 Word 文件中的数据提取/抓取到数据库中的方法。我们的公司程序在 MS Word 文件中记录了与客户的会议纪要,这主要是由于历史和惯性。

I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.

我希望能够将这些会议纪要中的行动项目提取到数据库中,以便我们可以从 Web 界面访问它们,将它们转换为任务并在完成时更新它们。

Which is the best way to do this:

这是最好的方法:

  1. VBA macro from inside Word to create CSV and then upload to the DB?
  2. VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
  3. Python script via win32com then upload to DB?
  1. 从 Word 内部创建 CSV 的 VBA 宏然后上传到数据库?
  2. Word 中的 VBA 宏与 DB 连接(如何从 VBA 连接到 MySQL?)
  3. 通过win32com的Python脚本然后上传到数据库?

The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.

最后一个对我很有吸引力,因为 Web 界面是用 Django 构建的,但我从未使用过 win32com 或尝试从 python 编写 Word 脚本。

EDIT:I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:

编辑:我已经开始使用 VBA 提取文本,因为它可以更轻松地处理 Word 对象模型。不过我遇到了一个问题——所有的文本都在表格中,当我从我想要的 CELLS 中拉出字符串时,我在每个字符串的末尾得到一个奇怪的小方框字符。我的代码看起来像:

sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum

num_rows = Application.ActiveDocument.Tables(2).Rows.Count

For n = 1 To num_rows
    Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
    Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
    Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
    If Target = "" Then
        ExportText = ""
    Else
        ExportText = Descr & Chr(44) & Assign & Chr(44) & _
            Target & Chr(13) & Chr(10)
        Print #fnum, ExportText
    End If
Next n

Close #fnum

What's up with the little control character box? Is some kind of character code coming across from Word?

小控制字符框是怎么回事?某种字符代码是否来自 Word?

采纳答案by Joel Spolsky

Word has a little marker thingy that it puts at the end of every cell of text in a table.

Word 有一个小标记,它放在表格中每个文本单元格的末尾。

It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.

它就像段落中的段落结束标记一样使用:存储整个段落的格式。

Just use the Left() function to strip it out, i.e.

只需使用 Left() 函数将其剥离即可,即

 Left(Target, Len(Target)-1))

By the way, instead of

顺便说一句,而不是

 num_rows = Application.ActiveDocument.Tables(2).Rows.Count
 For n = 1 To num_rows
      Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text

Try this:

尝试这个:

 For Each row in Application.ActiveDocument.Tables(2).Rows
      Descr = row.Cells(2).Range.Text

回答by nosklo

You could use OpenOffice. It can open word files, and also can run python macros.

您可以使用 OpenOffice。它可以打开word文件,也可以运行python宏。

回答by John Fouhy

Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:

好吧,我从来没有编写过 Word 脚本,但是用 win32com 做一些简单的事情是很容易的。就像是:

from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\stuff\myfile.doc')
doc.SaveAs(FileName='d:\stuff\text\myfile.txt', FileFormat=?)  # not sure what to use for ?

This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.

这是未经测试的,但我认为类似的东西只会打开文件并将其另存为纯文本(前提是您可以找到正确的文件格式)——然后您可以将文本读入 python 并从那里进行操作。也可能有一种直接获取文件内容的方法,但我不知道它是什么;文档可能很难找到,但如果你有 VBA 文档或经验,你应该能够随身携带它们。

Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.htmlScroll down to COMTools.py; there's some good examples there.

看看前段时间的这篇文章:http: //mail.python.org/pipermail/python-list/2002-October/168785.html 向下滚动到 COMTools.py;那里有一些很好的例子。

You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.

您还可以运行 makepy.py(pythonwin 发行版的一部分)为可用的 COM 函数生成 python“签名”,然后将其作为一种文档进行查看。

回答by cbrulak

how about saving the file as xml. then using python or something else and pull the data out of word and into the database.

将文件保存为 xml 怎么样。然后使用python或其他东西并将数据从单词中提取出来并放入数据库中。

回答by Dave Neeley

I'd say look at the related questions on the right --> The top oneseems to have some good ideas for going the python route.

我会说看看右边的相关问题 -->上面的问题似乎对走 python 路线有一些好主意。

回答by Fionnuala

It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.

可以以编程方式将 Word 文档保存为 HTML 并将包含的表导入 Access。这需要很少的努力。