Html 正则表达式提取 <div> 标签的内容

Question

提问by Mrk Fldig

Having a bit of a brain freeze here so I was hoping for some pointers, essentially I need to extract the contents of a specific div tag, yes I know that regex usually isn't approved of for this but its a simple web scraping application where there are no nested div's.

在这里有点头脑冻结所以我希望得到一些指示，基本上我需要提取特定 div 标签的内容，是的，我知道正则表达式通常不被批准用于此，但它是一个简单的网络抓取应用程序没有嵌套的 div。

I'm trying to match this:

我试图匹配这个：

    <div class="entry">
  <span class="title">Some company</span>
  <span class="description">
  <strong>Address: </strong>Some address
    <br /><strong>Telephone: </strong> 01908 12345
  </span>
</div>

simple vb code is as follows:

简单的vb代码如下：

    Dim myMatches As MatchCollection
    Dim myRegex As New Regex("<div.*?class=""entry"".*?>.*</div>", RegexOptions.Singleline)
    Dim wc As New WebClient
    Dim html As String = wc.DownloadString("http://somewebaddress.com")
    RichTextBox1.Text = html
    myMatches = myRegex.Matches(html)
    MsgBox(html)
    'Search for all the words in a string
    Dim successfulMatch As Match
    For Each successfulMatch In myMatches
        MsgBox(successfulMatch.Groups(1).ToString)
    Next

Any help would be greatly appreciated.

任何帮助将不胜感激。

Answer 1

回答by Tim Pietzcker

Your regex works for your example. There are some improvements that should be made, though:

您的正则表达式适用于您的示例。但是，应该进行一些改进：

<div[^<>]*class="entry"[^<>]*>(?<content>.*?)</div>

[^<>]*means "match any number of characters except angle brackets", ensuring that we don't accidentally break out of the tag we're in.

[^<>]*表示“匹配除尖括号外的任意数量的字符”，确保我们不会意外地跳出我们所在的标签。

.*?(note the ?) means "match any number of characters, but only as few as possible". This avoids matching from the first to the last <div class="entry">tag in your page.

.*?（注意?）表示“匹配任意数量的字符，但尽可能少”。这样可以避免从<div class="entry">页面中的第一个标签到最后一个标签进行匹配。

But your regex itself should still have matched something. Perhaps you're not using it correctly?

但是您的正则表达式本身应该仍然匹配某些内容。也许你没有正确使用它？

I don't know Visual Basic, so this is just a shot in the dark, but RegexBuddy suggests the following approach:

我不知道 Visual Basic，所以这只是在黑暗中的一个镜头，但 RegexBuddy 建议采用以下方法：

Dim RegexObj As New Regex("<div[^<>]*class=""entry""[^<>]*>(?<content>.*?)</div>")
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
    ResultList.Add(MatchResult.Groups("content").Value)
    MatchResult = MatchResult.NextMatch()
End While

I would recommend against taking the regex approach any further than this. If you insist, you'll end up with a monster regex like the following, which will only work if the form of the div's contents never varies:

我建议不要采取比这更进一步的正则表达式方法。如果你坚持，你最终会得到一个像下面这样的怪物正则表达式，它只有在div内容的形式永远不会改变时才有效：

<div[^<>]*class="entry"[^<>]*>\s*
<span[^<>]*class="title"[^<>]*>\s*
(?<title>.*?)
\s*</span>\s*
<span[^<>]*class="description"[^<>]*>\s*
<strong>\s*Address:\s*</strong>\s*
(?<address>.*?)
\s*<strong>\s*Telephone:\s*</strong>\s*
(?<phone>.*?)
\s*</span>\s*</div>

or (behold the joy of multiline strings in VB.NET):

或（看看 VB.NET 中多行字符串的乐趣）：

Dim RegexObj As New Regex(
    "<div[^<>]*class=""entry""[^<>]*>\s*" & chr(10) & _
    "<span[^<>]*class=""title""[^<>]*>\s*" & chr(10) & _
    "(?<title>.*?)" & chr(10) & _
    "\s*</span>\s*" & chr(10) & _
    "<span[^<>]*class=""description""[^<>]*>\s*" & chr(10) & _
    "<strong>\s*Address:\s*</strong>\s*" & chr(10) & _
    "(?<address>.*?)" & chr(10) & _
    "\s*<strong>\s*Telephone:\s*</strong>\s*" & chr(10) & _
    "(?<phone>.*?)" & chr(10) & _
    "\s*</span>\s*</div>", 
    RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)

(Of course, now you need to store the results for MatchResult.Groups("title")etc...)

（当然，现在您需要存储结果MatchResult.Groups("title")等...）

Answer 2

回答by freefaller

~~Try using RegexOptions.Multilineinstead of RegexOptions.Singleline~~

~~尝试使用RegexOptions.Multiline代替RegexOptions.Singleline~~

Thanks to @Tim for pointing out that the above doesn't work... my bad.

感谢@Tim 指出上述方法不起作用……我的错。

@Tim's answer is a good one, and should be the accepted answer, but an extra part that is stopping your code from working is that there is no 2nd group for Group(1)to return.

@Tim 的答案很好，应该是公认的答案，但是阻止您的代码工作的额外部分是没有第二组Group(1)可以返回。

Change...

改变...

MsgBox(successfulMatch.Groups(1).ToString)

To...

到...

MsgBox(successfulMatch.Groups(0).ToString)

Answer 3

回答by Ria

use this one

使用这个

<div.*?class=""entry"".*?>(?<divBody>.*)</div>

and get group named divBody

并获取名为divBody 的组

but be careful this not work if the string contain an other node div(and seems no way to resolve this by regex). for your solution xsltmay be useful.

但是请注意，如果字符串包含其他节点div（并且似乎无法通过正则表达式解决此问题），则这不起作用。对于您的解决方案xslt可能有用。

Answer 4

回答by user3227043

Really good article. Please see the below attached results from eclipse

真是好文章。请参阅下面附上的 eclipse 结果

Html 正则表达式提取 <div> 标签的内容

提问by Mrk Fldig

回答by Tim Pietzcker

回答by freefaller

回答by Ria

回答by user3227043

相关推荐

最近更新

标签

Html 正则表达式提取 <div> 标签的内容

提问by Mrk Fldig

回答by Tim Pietzcker

回答by freefaller

回答by Ria

回答by user3227043

相关推荐

Html 使垂直菜单高度适合 100%

Html CSS 特定表

Html CSS : 在页面中水平和垂直居中表单

打印大型 HTML 表格时如何处理分页符

相关推荐

最近更新

标签