vba Excel 中的富文本格式(带格式标签)到无格式文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1673025/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Rich text format (with formatting tags) in Excel to unformatted text
提问by imagodei
I have approx. 12000 cells in excel containing RTF (including formatting tags). I need to parse them to get to the unformatted text.
我有大约。包含 RTF(包括格式标签)的 12000 个 Excel 单元格。我需要解析它们以获取未格式化的文本。
This is the example of one of the cells with text:
这是带有文本的单元格之一的示例:
{\rtf1\ansi\deflang1060\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset238
Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs24\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw11908\paperh16833\margl1800\margr1800\margt1440\margb1440\headery720\footery720
\deftab720\formshade\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn11908\pghsxn16833\marglsxn1800\margrsxn1800\margtsxn1440\margbsxn1440
\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs24\pard TPR 0160 000\par IPR 0160 000\par OB-R-02-28\par}
And all I really need is this:
而我真正需要的是:
TPR 0160 000
IPR 0160 000
OB-R-02-28
The problem with simple looping over the cells and removing unnecessary formatting is, that not everything in those 12000 cells is as straightforward as this is. So I would need to manually inspect many different versions and write several variations; and still at the end there would be a lot of manual work to do.
简单循环单元格并删除不必要的格式的问题在于,并非这 12000 个单元格中的所有内容都像这样简单。所以我需要手动检查许多不同的版本并编写几个变体;最后还是有很多手工工作要做。
But if I copy the contents of one cell to empty text document and save it as RTF, then open it with MS Word, it instantly parses the text and I get exactly what I want. Unfortunately it's extremely inconvenient to do so for a 12000 cells.
但是,如果我将一个单元格的内容复制到空文本文档并将其另存为 RTF,然后使用 MS Word 打开它,它会立即解析文本并得到我想要的内容。不幸的是,对于 12000 个单元格这样做非常不方便。
So I was thinking about VBA macro, to move cell contents to Word, force parsing and then copy the result back to the originating cell. Unfortunately I'm not really sure how to do it.
所以我在考虑 VBA 宏,将单元格内容移动到 Word,强制解析,然后将结果复制回原始单元格。不幸的是,我不确定该怎么做。
Does anybody has any idea? Or a different approach? I will be really grateful for a solution or a push in the right direction.
有人有任何想法吗?或者不同的方法?我将非常感谢解决方案或朝着正确方向推动。
TNX!
TNX!
回答by Nossidge
If you did want to go down the route of using Word to parse the text, this function should help you out. As the comments suggest, you'll need a reference to the MS Word Object Library.
如果您确实想沿着使用 Word 解析文本的路线走下去,此功能应该可以帮助您。正如评论所暗示的那样,您将需要对 MS Word 对象库的引用。
Function ParseRTF(strRTF As String) As String
Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'
Dim f As Integer 'Variable to store the file I/O number'
'File path for a temporary .rtf file'
Const strFileTemp = "C:\TempFile_ParseRTF.rtf"
'Obtain the next valid file I/O number'
f = FreeFile
'Open the temp file and save the RTF string in it'
Open strFileTemp For Output As #f
Print #f, strRTF
Close #f
'Open the .rtf file as a Word.Document'
Set wdDoc = GetObject(strFileTemp)
'Read the now parsed text from the Word.Document'
ParseRTF = wdDoc.Range.Text
'Delete the temporary .rtf file'
Kill strFileTemp
'Close the Word connection'
wdDoc.Close False
Set wdDoc = Nothing
End Function
You could call it for each of your 12,000 cells using something similar to this:
您可以使用类似于以下内容的方法为 12,000 个单元格中的每个单元格调用它:
Sub ParseAllRange()
Dim rngCell As Range
Dim strRTF As String
For Each rngCell In Range("A1:A12000")
'Parse the cell contents'
strRTF = ParseRTF(CStr(rngCell))
'Output to the cell one column over'
rngCell.Offset(0, 1) = strRTF
Next
End Sub
The ParseRTF function takes about a second to run (on my machine at least), so for 12,000 cells this will work out at about three and a half hours.
ParseRTF 函数运行大约需要一秒钟(至少在我的机器上),因此对于 12,000 个单元,这将在大约三个半小时内完成。
Having thought about this problem over the weekend, I was sure there was a better (quicker) solution for this.
周末考虑过这个问题后,我确信有一个更好(更快)的解决方案。
I remembered the RTF capabilities of the clipboard, and realised that a class could be created that would copy RTF data to the clipboard, paste to a word doc, and output the resulting plain text. The benefit of this solution is that the word doc object would not have to be opened and closed for each rtf string; it could be opened before the loop and closed after.
我记得剪贴板的 RTF 功能,并意识到可以创建一个类,将 RTF 数据复制到剪贴板,粘贴到 word 文档,并输出结果纯文本。这个解决方案的好处是不必为每个 rtf 字符串打开和关闭 word doc 对象;它可以在循环之前打开并在循环之后关闭。
Below is the code to achieve this. It is a Class module named clsRTFParser.
下面是实现这一点的代码。它是一个名为 clsRTFParser 的类模块。
Private Declare Function GlobalAlloc Lib "kernel32" _
(ByVal wFlags&, ByVal dwBytes As Long) As Long
Private Declare Function GlobalLock Lib "kernel32" _
(ByVal hMem As Long) As Long
Private Declare Function GlobalUnlock Lib "kernel32" _
(ByVal hMem As Long) As Long
Private Declare Function lstrcpy Lib "kernel32" _
(ByVal lpString1 As Any, ByVal lpString2 As Any) As Long
Private Declare Function OpenClipboard Lib "user32" _
(ByVal Hwnd As Long) As Long
Private Declare Function EmptyClipboard Lib "user32" () As Long
Private Declare Function RegisterClipboardFormat Lib "user32" Alias _
"RegisterClipboardFormatA" (ByVal lpString As String) As Long
Private Declare Function SetClipboardData Lib "user32" _
(ByVal wFormat As Long, ByVal hMem As Long) As Long
Private Declare Function CloseClipboard Lib "user32" () As Long
'---'
Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'
Private Sub Class_Initialize()
Set wdDoc = New Word.Document
End Sub
Private Sub Class_Terminate()
wdDoc.Close False
Set wdDoc = Nothing
End Sub
'---'
Private Function CopyRTF(strCopyString As String) As Boolean
Dim hGlobalMemory As Long
Dim lpGlobalMemory As Long
Dim hClipMemory As Long
Dim lngFormatRTF As Long
'Allocate and copy string to memory'
hGlobalMemory = GlobalAlloc(&H42, Len(strCopyString) + 1)
lpGlobalMemory = GlobalLock(hGlobalMemory)
lpGlobalMemory = lstrcpy(lpGlobalMemory, strCopyString)
'Unlock the memory and then copy to the clipboard'
If GlobalUnlock(hGlobalMemory) = 0 Then
If OpenClipboard(0&) <> 0 Then
Call EmptyClipboard
'Save the data as Rich Text Format'
lngFormatRTF = RegisterClipboardFormat("Rich Text Format")
hClipMemory = SetClipboardData(lngFormatRTF, hGlobalMemory)
CopyRTF = CBool(CloseClipboard)
End If
End If
End Function
'---'
Private Function PasteRTF() As String
Dim strOutput As String
'Paste the clipboard data to the wdDoc and read the plain text result'
wdDoc.Range.Paste
strOutput = wdDoc.Range.Text
'Get rid of the new lines at the beginning and end of the document'
strOutput = Left(strOutput, Len(strOutput) - 2)
strOutput = Right(strOutput, Len(strOutput) - 2)
PasteRTF = strOutput
End Function
'---'
Public Function ParseRTF(strRTF As String) As String
If CopyRTF(strRTF) Then
ParseRTF = PasteRTF
Else
ParseRTF = "Error in copying to clipboard"
End If
End Function
You could call it for each of your 12,000 cells using something similar to this:
您可以使用类似于以下内容的方法为 12,000 个单元格中的每个单元格调用它:
Sub CopyParseAllRange()
Dim rngCell As Range
Dim strRTF As String
'Create new instance of clsRTFParser'
Dim RTFParser As clsRTFParser
Set RTFParser = New clsRTFParser
For Each rngCell In Range("A1:A12000")
'Parse the cell contents'
strRTF = RTFParser.ParseRTF(CStr(rngCell))
'Output to the cell one column over'
rngCell.Offset(0, 1) = strRTF
Next
End Sub
I have simulated this using example RTF strings on my machine. For 12,000 cells it took two and a half minutes, a much more reasonable time frame!
我在我的机器上使用示例 RTF 字符串对此进行了模拟。对于 12,000 个单元格需要两分半钟,这是一个更合理的时间范围!
回答by juckobee
You can try to parse every cell with regular expression and leave only the content you need.
你可以尝试用正则表达式解析每个单元格,只留下你需要的内容。
Every RTF control code start with "\" and ends with space, without any additional space between. "{}" are use for grouping. If your text won't contain any, you can just remove them (the same for ";"). So now you stay with your text and some unnecessary words as "Arial", "Normal" etc. You can build the dictionary to remove them also. After some tweaking, you will stay with only the text you need.
每个 RTF 控制代码都以“\”开头并以空格结尾,中间没有任何额外的空格。“{}”用于分组。如果您的文本不包含任何内容,您可以删除它们(“;”也是如此)。所以现在你保留你的文本和一些不必要的词,如“Arial”、“Normal”等。你也可以建立字典来删除它们。经过一些调整后,您将只保留所需的文本。
Look at http://www.regular-expressions.info/for more information and great tool to write RegExp's (RegexBuddy - unfortunately it isn't free, but it's worth the money. AFAIR there is also trial).
查看http://www.regular-expressions.info/了解更多信息和编写 RegExp 的好工具(RegexBuddy - 不幸的是它不是免费的,但它物有所值。AFAIR 也有试用版)。
UPDATE: Of course, I don't encourage you to do it manually for every cell. Just iterate through active range: Refer this thread: SO: About iterating through cells in VBA
更新:当然,我不鼓励您为每个单元格手动执行此操作。只需遍历活动范围:请参阅此线程: SO:关于遍历 VBA 中的单元格
Personally, I'll give a try to this idea:
就我个人而言,我会尝试一下这个想法:
Sub Iterate()
For Each Cell in ActiveSheet.UsedRange.Cells
'Do something
Next
End Sub
And how to use RegExp's in VBA (Excel)?
以及如何在 VBA (Excel) 中使用 RegExp?
Refer: Regex functions in Exceland Regex in VBA
参考: Excel 中的 Regex 函数和 VBA 中的 Regex
Basically you've to use VBScript.RegExp object through COM.
基本上,您必须通过 COM 使用 VBScript.RegExp 对象。
回答by Wilson
Some of the solutions here require a reference to the MS Word Object Library. Playing with the cards I am dealt, I found a solution that does not rely on it. It strips RTF tags, and other fluff like font tables and stylesheets, all in VBA. It might be helpful to you. I ran it across your data, and other than the whitespace, I get the same output as what you expected.
此处的某些解决方案需要对 MS Word 对象库的引用。玩我发的牌,我找到了一个不依赖它的解决方案。它在 VBA 中去除 RTF 标签以及其他诸如字体表和样式表之类的内容。它可能对你有帮助。我在您的数据中运行它,除了空格之外,我得到的输出与您的预期相同。
Here is the code.
这是代码。
First, something to check if a string is alphanumeric or not. Give it a string that's one character long. This function is used to work out delimitation here and there.
首先,检查字符串是否为字母数字。给它一个一个字符长的字符串。此函数用于在这里和那里计算定界。
Public Function Alphanumeric(Character As String) As Boolean
If InStr("ABCDEFGHIJKKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-", Character) Then
Alphanumeric = True
Else
Alphanumeric = False
End If
End Function
Next up is to remove and entire group. I use this to remove font tables and other rubbish.
接下来是删除和整个组。我用它来删除字体表和其他垃圾。
Public Function RemoveGroup(RTFString As String, GroupName As String) As String
Dim I As Integer
Dim J As Integer
Dim Count As Integer
I = InStr(RTFString, "{\" & GroupName)
' If the group was not found in the RTF string, then just return that string unchanged.
If I = 0 Then
RemoveGroup = RTFString
Exit Function
End If
' Otherwise, we will need to scan along, from the start of the group, until we find the end of the group.
' The group is delimited by { and }. Groups may be nested, so we need to count up if we encounter { and
' down if we encounter }. When that count reaches zero, then the end of the group has been found.
J = I
Do
If Mid(RTFString, J, 1) = "{" Then Count = Count + 1
If Mid(RTFString, J, 1) = "}" Then Count = Count - 1
J = J + 1
Loop While Count > 0
RemoveGroup = Replace(RTFString, Mid(RTFString, I, J - I), "")
End Function
Okay, and this function removes any tags.
好的,这个函数会删除任何标签。
Public Function RemoveTags(RTFString As String) As String
Dim L As Long
Dim R As Long
L = 1
' Search to the end of the string.
While L < Len(RTFString)
' Append anything that's not a tag to the return value.
While Mid(RTFString, L, 1) <> "\" And L < Len(RTFString)
RemoveTags = RemoveTags & Mid(RTFString, L, 1)
L = L + 1
Wend
'Search to the end of the tag.
R = L + 1
While Alphanumeric(Mid(RTFString, R, 1)) And R < Len(RTFString)
R = R + 1
Wend
L = R
Wend
End Function
We can remove curly braces in the obvious way:
我们可以通过明显的方式删除花括号:
Public Function RemoveBraces(RTFString As String) As String
RemoveBraces = Replace(RTFString, "{", "")
RemoveBraces = Replace(RemoveBraces, "}", "")
End Function
Once you have the functions above copy-pasted into your module, you can create a function that uses them to strip away any stuff you don't need or want. The following works perfectly in my case.
一旦您将上述函数复制粘贴到您的模块中,您就可以创建一个函数,使用它们去除您不需要或不需要的任何内容。以下在我的情况下非常有效。
Public Function RemoveTheFluff(RTFString As String) As String
RemoveTheFluff = Replace(RTFString, vbCrLf, "")
RemoveTheFluff = RemoveGroup(RemoveTheFluff, "fonttbl")
RemoveTheFluff = RemoveGroup(RemoveTheFluff, "colortbl")
RemoveTheFluff = RemoveGroup(RemoveTheFluff, "stylesheet")
RemoveTheFluff = RemoveTags(RemoveBraces(RemoveTheFluff))
End Function
I hope this helps. I wouldn't use it in a word processor or anything, but it might do for scraping data if that's what you're doing.
我希望这有帮助。我不会在文字处理器或任何东西中使用它,但如果这是你正在做的事情,它可能会用于抓取数据。
回答by Robert Thompson
Your post made it sound as if each RTF document was stored in a single Excell cell. If so, then
您的帖子听起来好像每个 RTF 文档都存储在单个 Excell 单元格中。如果是这样,那么
Solution using .Net Framework RichTextBox control
使用.Net Framework RichTextBox控件的解决方案
will convert the RTF in each cell to plain text in 2 lines of code (after a little system configuration to get the right .tlb file to allow reference to the .Net Framework). Put the cell value in rtfsampleand
将在 2 行代码中将每个单元格中的 RTF 转换为纯文本(经过一些系统配置以获得正确的 .tlb 文件以允许引用 .Net Framework)。将单元格值放入rtfsample和
Set miracle = New System_Windows_Forms.RichTextBox
With miracle
.RTF = rtfText
PlainText = .TEXT
End With

