如何使用 Excel VBA 在 PDF 中搜索和突出显示文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22779394/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to search and highlight text in PDF using Excel VBA
提问by rex
Right, so after hours of searching; I've come up with nothing for excel vba, which I find surprising. Found some vbs that I tried to port over but no luck. I have managed to import the pdf text into sheets and search it, which is good; but this won't allow me to actually highlight the pdf obviously.
是的,经过数小时的搜索;我对 excel vba 一无所知,这让我感到惊讶。找到了一些我试图移植但没有运气的 vbs。我已经设法将 pdf 文本导入到工作表中并进行搜索,这很好;但这不会让我真正突出显示pdf。
What I'm trying to do is open up PDF docs, search them for keywords and then highlight those words and save. I've got adobe acrobat X, so there must be some sort of API that will allow me to do this with excel vba? Am I going to have to use some sort of opensource library like iText; I would prefer not to.
我要做的是打开 PDF 文档,搜索关键字,然后突出显示这些单词并保存。我有 adobe acrobat X,所以必须有某种 API 可以让我用 excel vba 做到这一点?我是否将不得不使用某种像 iText 这样的开源库?我宁愿不这样做。
Some of the vbs that I saw involved finding text letter by letter and then drawing rectangles around it and colouring with javascript and that just seemed unnecessarily complicated (couldn't get the port to work anyway...).
我看到的一些 vbs 涉及一个字母一个字母地查找文本,然后在它周围绘制矩形并使用 javascript 着色,这似乎不必要地复杂(无论如何都无法让端口工作......)。
CLARIFICATION:I don't want to highlight the text in excel, I want to highlight it on the PDF. I am only reading it into Excel to search for the text and see if its in the PDF, since I don't know how else to do this.
澄清:我不想在 excel 中突出显示文本,我想在 PDF 上突出显示它。我只是将它读入 Excel 以搜索文本并查看它是否在 PDF 中,因为我不知道还能怎么做。
PS: It would also be nice to be able to use OCR on image pdfs.
PS:能够在图像 pdf 上使用 OCR 也很好。
回答by ReFran
Ok, played a little bit around with the code I already have had and js annots. Attached you will find a VBScript which can mark/highlight a word permanent. It can easily be changed to mark also more as only one word. In the AcroJS help file you can find some options for the markers outfit.
好的,使用我已经拥有的代码和 js 注释玩了一下。随附您将找到一个 VBScript,它可以标记/突出显示一个词永久。可以很容易地将其更改为仅标记一个单词。在 AcroJS 帮助文件中,您可以找到标记装备的一些选项。
The VBS code I wrote VBA like. So you can copy it direct into your IDE.
我写 VBA 之类的 VBS 代码。因此,您可以将其直接复制到您的 IDE 中。
Enjoy, Reinhard
享受,莱因哈德
'// Save this as xxx.vbs and start with Double Click
'// Acrobat must be opend before with an active document!! -otherwise error-
wordTF = "Reinhard" '//word to find
pdfText = ""
set WshShell = CreateObject ("Wscript.Shell")
WshShell.AppActivate("Adobe Acrobat")
WScript.Sleep 500
'// get the active Document
Set AcroApp = CreateObject("AcroExch.App")
Set AVDoc = AcroApp.GetActiveDoc
Set PDDoc = AVDoc.GetPDDoc
Set AForm = CreateObject("AFormAut.App") 'connect to Form API for later use
maxPages = PdDoc.GetNumPages
for p = 0 to maxPages - 1 '// start the page loop
Set PdfPage = PDDoc.AcquirePage(p) '// p = Pagenumber (zero based)
Set PageHL = CreateObject("AcroExch.HiliteList") '// created to get the page text
PageHLRes = PageHL.Add(0,9000) '<<--SET in FILE! (Start,END[9000=All])
Set PageSel = PdfPage.CreatePageHilite(PageHL)
for i = 0 to PageSel.Getnumtext - 1 '// start the word loop on current page
word = PageSel.getText(i) '// get one word
pdfText = pdfText & word '// gather words on page
if instr(word, wordTF) then '// used instr because the "word" you may get as "word "
msgbox("add:""" &word &"""") Set wordToHl = CreateObject("AcroExch.HiliteList") '// created to get the word on list
wordToHl.Add i, 1 'Hilite the word Reinhard
Set wordHl = PdfPage.CreateWordHilite(wordToHl)
Set rect = wordHl.GetBoundingRect
msgbox("left:" &rect.Left &" bot:" &rect.bottom &" right:"&rect.Right &" top:" &rect.Top)
AVDoc.SetTextSelection(wordHl) '// highlight the word (not really needed)
AVDoc.ShowTextSelect() '// show highlighted text (not really needed)
'// write and execute js to mark permanent (to lazy to translate to jso)
ex = " // set annot for text selection " &vbLf _
& "var sqannot = this.addAnnot({type: ""Square"", page: 1, " &vbLf _
& "rect: [" &rect.left &", "& rect.top &", " &rect.right &", " &rect.bottom &"], " &vbLf _
& "name: ""p" &p &"i" &i &"""});"
msgbox(ex)
AForm.Fields.ExecuteThisJavaScript ex
end if '// word found
Next '// get next word
msgBox(pdfText)
pdfText = ""
next '// get next page
msgbox("Done!")
回答by Max Wyss
There are some possibilities to remote control Acrobat. On Mac, it is via AppleScript, and on Windows, it is via VB/VBS (if I remember correctly). In any case, you then have the possibility to run Acrobat JavaScript.
有一些远程控制 Acrobat 的可能性。在 Mac 上,它通过 AppleScript,而在 Windows 上,它通过 VB/VBS(如果我没记错的话)。在任何情况下,您都可以运行 Acrobat JavaScript。
You might download the Acrobat SDK from the Adobe website, and look through the Documentation folder.
您可以从 Adobe 网站下载 Acrobat SDK,然后浏览文档文件夹。
Despite the not so good experiences, this is kind of the way to go: loop through all pages of the document, loop through all the "words" on the actual page, read out the coordinates of the bounding box of the found word (also known as "quads"), maybe do some comparisons with other "words", to figure out whether these "words" do belong together. Finally create a Highlight Annotation using as coordinates the read out quads.
尽管体验不太好,但这是一种可行的方法:遍历文档的所有页面,遍历实际页面上的所有“单词”,读出找到的单词的边界框的坐标(也称为“quads”),也许可以与其他“单词”进行一些比较,以找出这些“单词”是否属于一起。最后使用读出的四边形作为坐标创建一个突出显示注释。
Another possibility for finding words in a PDF document would be using the markup part of the Redaction tool (stop the redaction process before the removing and writing back of the redacted document happens). Then you would run an Acrobat JavaScript enumerating all the Redaction type annotations, and replace them with similar Highlight annotations.
在 PDF 文档中查找单词的另一种可能性是使用编校工具的标记部分(在删除和写回编校文档之前停止编校过程)。然后,您将运行一个 Acrobat JavaScript,枚举所有 Redaction 类型注释,并将它们替换为类似的 Highlight 注释。
回答by DanK
Excel cannot open the pdf file format legibly.
Excel 无法清晰地打开 pdf 文件格式。
In order to do what you are trying to, you will need some sort of PDF Converter to translate the document into a format that Excel can read (such as xls or txt). Then you can use normal .Find and .Format methods to complete your task.
为了做您想做的事情,您需要某种 PDF 转换器将文档转换为 Excel 可以读取的格式(例如 xls 或 txt)。然后你可以使用普通的 .Find 和 .Format 方法来完成你的任务。
Here are some free converters that I found from a quick Google Search (though note I have not used any of these so you'll probably want to do additional research)
以下是我从 Google 快速搜索中找到的一些免费转换器(但请注意,我没有使用过其中任何一个,因此您可能需要进行额外的研究)
http://www.freepdfconvert.com/?
http://www.pdf995.com/download.html
http://www.cutepdf.com/
http://www.primopdf.com/
http://sourceforge.net/projects/pdfcreat...
Please note, however, that saving these back in the format you opened them is most likely going to be impossible. It really all depends on how good the converter is. Ultimately, I don't think Excel is the tool you want to use for this task.
但是请注意,以您打开它们的格式保存这些文件很可能是不可能的。这实际上完全取决于转换器的好坏。最终,我认为 Excel 不是您要用于此任务的工具。