vba 从 MS Word 中提取数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/505925/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extracting data from MS Word
提问by Technical Bard
I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
我正在寻找一种将 Word 文件中的数据提取/抓取到数据库中的方法。我们的公司程序在 MS Word 文件中记录了与客户的会议纪要,这主要是由于历史和惯性。
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
我希望能够将这些会议纪要中的行动项目提取到数据库中,以便我们可以从 Web 界面访问它们,将它们转换为任务并在完成时更新它们。
Which is the best way to do this:
这是最好的方法:
- VBA macro from inside Word to create CSV and then upload to the DB?
- VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
- Python script via win32com then upload to DB?
- 从 Word 内部创建 CSV 的 VBA 宏然后上传到数据库?
- Word 中的 VBA 宏与 DB 连接(如何从 VBA 连接到 MySQL?)
- 通过win32com的Python脚本然后上传到数据库?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
最后一个对我很有吸引力,因为 Web 界面是用 Django 构建的,但我从未使用过 win32com 或尝试从 python 编写 Word 脚本。
EDIT:I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
编辑:我已经开始使用 VBA 提取文本,因为它可以更轻松地处理 Word 对象模型。不过我遇到了一个问题——所有的文本都在表格中,当我从我想要的 CELLS 中拉出字符串时,我在每个字符串的末尾得到一个奇怪的小方框字符。我的代码看起来像:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?
小控制字符框是怎么回事?某种字符代码是否来自 Word?
采纳答案by Joel Spolsky
Word has a little marker thingy that it puts at the end of every cell of text in a table.
Word 有一个小标记,它放在表格中每个文本单元格的末尾。
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
它就像段落中的段落结束标记一样使用:存储整个段落的格式。
Just use the Left() function to strip it out, i.e.
只需使用 Left() 函数将其剥离即可,即
Left(Target, Len(Target)-1))
By the way, instead of
顺便说一句,而不是
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
尝试这个:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text
回答by nosklo
You could use OpenOffice. It can open word files, and also can run python macros.
您可以使用 OpenOffice。它可以打开word文件,也可以运行python宏。
回答by John Fouhy
Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
好吧,我从来没有编写过 Word 脚本,但是用 win32com 做一些简单的事情是很容易的。就像是:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\stuff\myfile.doc')
doc.SaveAs(FileName='d:\stuff\text\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
这是未经测试的,但我认为类似的东西只会打开文件并将其另存为纯文本(前提是您可以找到正确的文件格式)——然后您可以将文本读入 python 并从那里进行操作。也可能有一种直接获取文件内容的方法,但我不知道它是什么;文档可能很难找到,但如果你有 VBA 文档或经验,你应该能够随身携带它们。
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.htmlScroll down to COMTools.py; there's some good examples there.
看看前段时间的这篇文章:http: //mail.python.org/pipermail/python-list/2002-October/168785.html 向下滚动到 COMTools.py;那里有一些很好的例子。
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.
您还可以运行 makepy.py(pythonwin 发行版的一部分)为可用的 COM 函数生成 python“签名”,然后将其作为一种文档进行查看。
回答by cbrulak
how about saving the file as xml. then using python or something else and pull the data out of word and into the database.
将文件保存为 xml 怎么样。然后使用python或其他东西并将数据从单词中提取出来并放入数据库中。
回答by Dave Neeley
回答by Fionnuala
It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.
可以以编程方式将 Word 文档保存为 HTML 并将包含的表导入 Access。这需要很少的努力。