vba 从字符串中剥离 HTML

Question

提问by Ann Sanderson

I've tried a number of things but nothing seems to be working properly. I have an Access DB and am writing code in VBA. I have a string of HTML source code that I am interested in stripping all of the HTML code and Tags out of so that I just have plain text string with no html or tags left. What is the best way to do this?

我已经尝试了很多东西，但似乎没有任何工作正常。我有一个 Access DB，正在用 VBA 编写代码。我有一个 HTML 源代码字符串，我有兴趣从中剥离所有 HTML 代码和标签，这样我就只有纯文本字符串，没有剩下的 html 或标签。做这个的最好方式是什么？

Thanks

谢谢

Answer 1

回答by Alex K.

One way that's as resilient as possible to bad markup;

一种对不良标记尽可能有弹性的方法；

with createobject("htmlfile")
    .open
    .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>"
    .close
    msgbox "text=" & .body.outerText
end with

Answer 2

回答by Lior

    Function StripHTML(cell As Range) As String  
 Dim RegEx As Object  
 Set RegEx = CreateObject("vbscript.regexp")  

 Dim sInput As String  
 Dim sOut As String  
 sInput = cell.Text  

 With RegEx  
   .Global = True  
   .IgnoreCase = True  
   .MultiLine = True  
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.  
 End With  

 sOut = RegEx.Replace(sInput, "")  
 StripHTML = sOut  
 Set RegEx = Nothing  
End Function

This might help you, Good luck.

这可能对你有帮助，祝你好运。

Answer 3

回答by SWa

It depends how complex the html structure is and how much data you're wanting out of it.

这取决于 html 结构的复杂程度以及您想要从中获取多少数据。

Depending on the complexity you might get away with regular expressions, but for complex markup trying to parse data from html with regex is like trying to eat soup with a fork.

根据复杂性，您可能会使用正则表达式，但对于复杂的标记，尝试使用正则表达式解析来自 html 的数据就像尝试用叉子吃汤一样。

You can use the htmFile object to turn the flat file into objects that you can interact with, for example:

您可以使用 htmFile 对象将平面文件转换为您可以与之交互的对象，例如：

Function ParseATable(url As String) As Variant 

    Dim htm As Object, table As Object 
    Dim data() As String, x As Long, y As Long 
    Set htm = CreateObject("HTMLfile") 
    With CreateObject("MSXML2.XMLHTTP") 
        .Open "GET", url, False 
        .send 
        htm.body.innerhtml = .responsetext 
    End With 

    With htm 
        Set table = .getelementsbytagname("table")(0) 
        Redim data(1 To table.Rows.Length, 1 To 10) 
        For x = 0 To table.Rows.Length - 1 
            For y = 0 To table.Rows(x).Cells.Length - 1 
                data(x + 1, y + 1) = table.Rows(x).Cells(y).InnerText 
            Next y 
        Next x 

        ParseATable = data 

    End With 
End Function

Answer 4

回答by Zev Spitz

Using early binding:

使用早期绑定：

Public Function GetText(inputHtml As String) As String
With New HTMLDocument
    .Open
    .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>"
    .Close
   StripHtml = .body.outerText
End With
End Function

Answer 5

回答by Bob Kennedy

An improvement over one of the above... It finds quotes and line feeds and replaces them with the non-HTML equivalents. Also, the original function had a problem with embedded UNC references (ie: <\server\share\folder\file.ext>). It would remove the entire UNC string due to < at the beginning and > at the end. This function fixes that so the UNC gets inserted into the string correctly:

对上述之一的改进......它找到引号和换行符，并用非 HTML 等效项替换它们。此外，原始函数在嵌入 UNC 引用时存在问题（即：<\server\share\folder\file.ext>）。由于 < 开头和 > 结尾，它将删除整个 UNC 字符串。此函数修复了该问题，以便 UNC 正确插入到字符串中：

Function StripHTML(strString As String) As String
 Dim RegEx As Object
 Set RegEx = CreateObject("vbscript.regexp")

 Dim sInput As String
 Dim sOut As String
 sInput = Replace(strString, "<\", "\")

 With RegEx
   .Global = True
   .IgnoreCase = True
   .MultiLine = True
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.
 End With

 sOut = RegEx.Replace(sInput, "")
 StripHTML = Replace(Replace(Replace(sOut, "&nbsp;", vbCrLf, 1, -    1), "&quot;", "'", 1, -1), "\", "<\", 1, -1)
 Set RegEx = Nothing
End Function

Answer 6

回答by Francois Muller

I found a really simple solutions to this. I currently run an access database and use excel forms to update the system due to system restrictions and shared drive privileges. when I call the data from Access I use: Plaintext(YourStringHere) this will remove all html parts and only leave the text.

我找到了一个非常简单的解决方案。由于系统限制和共享驱动器权限，我目前运行一个访问数据库并使用 excel 表单来更新系统。当我从 Access 调用数据时，我使用：Plaintext( YourStringHere) 这将删除所有 html 部分，只保留文本。

hope this works.

希望这有效。

vba 从字符串中剥离 HTML

提问by Ann Sanderson

回答by Alex K.

回答by Lior

回答by SWa

回答by Zev Spitz

回答by Bob Kennedy

回答by Francois Muller

相关推荐

最近更新

标签

vba 从字符串中剥离 HTML

提问by Ann Sanderson

回答by Alex K.

回答by Lior

回答by SWa

回答by Zev Spitz

回答by Bob Kennedy

回答by Francois Muller

相关推荐

vba 格式条件和内部颜色

vba 比较Excel VBA中的两列（大于/小于或等于）

vba 如何使用vba查找上次使用的列的地址

vba 编写将宏写入另一个 Excel 文件的宏

相关推荐

最近更新

标签