vba 如何提取文本字符串中的文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7086270/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract text within a string of text
提问by Brandon
I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.
我有一个简单的问题,我希望在不使用 VBA 的情况下解决它,但如果这是唯一可以解决的方法,那就这样吧。
I have a file with multiple rows (all one column). Each row has data that looks something like this:
我有一个包含多行(全部为一列)的文件。每行都有如下所示的数据:
1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom
1 7.82E-13 >gi|297848936|ref|XP_00| 4-氢氧化物gi|297338191|gb|23343|randomrandom
2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate
2 5.09E-09 >gi|168010496|ref|xp_00| 2-丙酮酸
etc...
等等...
What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.
我想要的是某种方法来提取以“gi|”开头的数字串 并以“|”结尾。对于某些行,这可能意味着多达 5 个 gi 数字,对于其他行,它只会是一个。
What I would hope the output would look like would be something like:
我希望输出看起来像这样:
297848936,297338191
297848936,297338191
168010496
168010496
etc...
等等...
回答by aevanko
Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here: http://www.regular-expressions.info/
这是使用正则表达式对象的非常灵活的 VBA 答案。该函数的作用是提取它找到的每个子组匹配项(括号内的内容),用您想要的任何字符串分隔(默认为“,”)。您可以在此处找到有关正则表达式的信息:http: //www.regular-expressions.info/
You would call it like this, assuming that first string is in A1:
您可以这样称呼它,假设第一个字符串在 A1 中:
=RegexExtract(A1,"gi[|](\d+)[|]")
Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:
因为这会寻找所有出现的“gi|” 后跟一系列数字,然后是另一个“|”,对于您问题中的第一行,这将为您提供以下结果:
297848936, 297338191
Just run this down the column and you're all done!
只需在列中运行它,您就完成了!
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional separator As String = ", ") As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String
RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.count - 1
For j = 0 To allMatches.Item(i).submatches.count - 1
result = result & (separator & allMatches.Item(i).submatches.Item(j))
Next
Next
If Len(result) <> 0 Then
result = Right$(result, Len(result) - Len(separator))
End If
RegexExtract = result
End Function
回答by Zelgada
Here it is (assuming data is in column A)
在这里(假设数据在 A 列中)
=VALUE(LEFT(RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2),
FIND("|",RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2)) -1 ))
Not the nicest formula, but it will work to extract the number.
不是最好的公式,但它可以提取数字。
I just noticed since you have two values per row with output separated by commas. You will need to check if there is a second match, third match etc. to make it work for multiple numbers per cell.
我刚刚注意到,因为每行有两个值,输出用逗号分隔。您需要检查是否有第二个匹配项、第三个匹配项等,以使其适用于每个单元格的多个数字。
In reference to your exact sample (assuming 2 values maximum per cell) the following code will work:
参考您的确切样本(假设每个单元格最多有 2 个值),以下代码将起作用:
=IF(ISNUMBER(FIND("gi|",$A1,FIND("gi|", $A1)+1)),CONCATENATE(LEFT(RIGHT($A1,LEN($A1)
- FIND("gi|",$A1) - 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ),
", ",LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1)
- 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) - 2))
-1 )),LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2),
FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ))
How's that for ugly? A VBA solution may be better for you, but I'll leave this here for you.
怎么这么丑?VBA 解决方案可能更适合您,但我会留在这里给您。
To go up to 5 numbers, well, study the pattern and recurse manually in the formula. IT will get long!
要达到 5 个数字,请研究模式并在公式中手动递归。它会变长!
回答by Jason S
I'd probably split the data first on the |
delimiter using the convert text to columns wizard.
In Excel 2007 that is on the Datatab, Data Toolsgroup and then choose Text to Columns. Specify Other:and |
as the delimiter.
我可能会首先|
使用将文本转换为列向导在分隔符上拆分数据。在 Excel 2007 中的数据选项卡上,数据工具组,然后选择文本到列。指定Other:and|
作为分隔符。
From the sample data you posted it looks like after you do this the numbers will all be in the same columns so you could then just delete the columns you don't want.
从您发布的示例数据来看,执行此操作后,这些数字都将位于同一列中,因此您可以删除不需要的列。
回答by Tiago Cardoso
As the other guys presented the solution without VBA... I'll present the one that does use. Now, is your call to use it or no.
由于其他人提出了没有 VBA 的解决方案......我将提出一个确实使用的解决方案。现在,您是否要求使用它。
Just saw that @Issun presented the solution with regex, very nice! Either way, will present a 'modest' solution for the question, using only 'plain' VBA.
刚刚看到@Issun 用正则表达式给出了解决方案,非常好!无论哪种方式,都将为该问题提供“适度”的解决方案,仅使用“普通”VBA。
Option Explicit
Option Base 0
Sub findGi()
Dim oCell As Excel.Range
Set oCell = Sheets(1).Range("A1")
'Loops through every row until empty cell
While Not oCell.Value = ""
oCell.Offset(0, 1).Value2 = GetGi(oCell.Value)
Set oCell = oCell.Offset(1, 0)
Wend
End Sub
Private Function GetGi(ByVal sValue As String) As String
Dim sResult As String
Dim vArray As Variant
Dim vItem As Variant
Dim iCount As Integer
vArray = Split(sValue, "|")
iCount = 0
'Loops through the array...
For Each vItem In vArray
'Searches for the 'Gi' factor...
If vItem Like "*gi" And UBound(vArray) > iCount + 1 Then
'Concatenates the results...
sResult = sResult & vArray(iCount + 1) & ","
End If
iCount = iCount + 1
Next vItem
'And removes trail comma
If Len(sResult) > 0 Then
sResult = Left(sResult, Len(sResult) - 1)
End If
GetGi = sResult
End Function