C# 使用正则表达式获取多个 HTML 标签之间的文本

Question

提问by Ben

Using regex, I want to be able to get the text between multiple DIV tags. For instance, the following:

使用正则表达式，我希望能够在多个 DIV 标签之间获取文本。例如，以下内容：

<div>first html tag</div>
<div>another tag</div>

Would output:

会输出：

first html tag
another tag

The regex pattern I am using only matches my last div tag and misses the first one. Code:

我使用的正则表达式模式只匹配我的最后一个 div 标签而错过了第一个。代码：

    static void Main(string[] args)
    {
        string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
        string pattern = "(<div.*>)(.*)(<\/div>)";

        MatchCollection matches = Regex.Matches(input, pattern);
        Console.WriteLine("Matches found: {0}", matches.Count);

        if (matches.Count > 0)
            foreach (Match m in matches)
                Console.WriteLine("Inner DIV: {0}", m.Groups[2]);

        Console.ReadLine();
    }

Output:

输出：

Matches found: 1

找到匹配项：1

Inner DIV: This is ANOTHER test

内部 DIV：这是另一个测试

Answer 1

回答by coolmine

Replace your pattern with a non greedy match

用非贪婪匹配替换您的模式

static void Main(string[] args)
{
    string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
    string pattern = "<div.*?>(.*?)<\/div>";

    MatchCollection matches = Regex.Matches(input, pattern);
    Console.WriteLine("Matches found: {0}", matches.Count);

    if (matches.Count > 0)
        foreach (Match m in matches)
            Console.WriteLine("Inner DIV: {0}", m.Groups[1]);

    Console.ReadLine();
}

Answer 2

回答by Mayman

First of all remember that in the HTML file you will have a new line symbol("\n"), which you have not included in the String which you are using to check your regex.

首先请记住，在 HTML 文件中，您将有一个换行符（“\n”），您没有将其包含在用于检查正则表达式的字符串中。

Second by taking you regex:

其次是带你正则表达式：

((<div.*>)(.*)(<\/div>))+ //This Regex will look for any amount of div tags, but it must see at least one div tag.

((<div.*>)(.*)(<\/div>))* //This regex will look for any amount of div tags, and it will not complain if there are no results at all.

Also a good place to look for this sort of information:

也是寻找此类信息的好地方：

http://www.regular-expressions.info/reference.html

http://www.regular-expressions.info/refadv.html

Mayman

梅曼

Answer 3

回答by Tom Jacques

The short version is that you cannot do this correctly in all situations. There will always be cases of valid HTML for which a regular expression will fail to extract the information you want.

简而言之，您无法在所有情况下都正确执行此操作。总会有一些有效的 HTML 的情况下，正则表达式将无法提取您想要的信息。

The reason is because HTML is a context free grammar which is a more complex class than a regular expression.

原因是因为 HTML 是一种上下文无关文法，它是一个比正则表达式更复杂的类。

Here's an example -- what if you have multiple stacked divs?

这是一个例子——如果你有多个堆叠的 div 怎么办？

<div><div>stuff</div><div>stuff2</div></div>

The regexes listed as other answers will grab:

作为其他答案列出的正则表达式将抓取：

<div><div>stuff</div>
<div>stuff</div>
<div>stuff</div><div>stuff2</div>
<div>stuff</div><div>stuff2</div></div>
<div>stuff2</div>
<div>stuff2</div></div>

because that's what regular expressions do when they try to parse HTML.

因为这就是正则表达式在尝试解析 HTML 时所做的事情。

You can't write a regular expression that understands how to interpret all of the cases, because regular expressions are incapable of doing so. If you are dealing with a very specific constrained set of HTML, it may be possible, but you should keep this fact in mind.

你不能写出一个理解如何解释所有情况的正则表达式，因为正则表达式不能这样做。如果您正在处理一组非常具体的受约束的 HTML，这是可能的，但您应该记住这一事实。

More information: https://stackoverflow.com/a/1732454/2022565

更多信息：https: //stackoverflow.com/a/1732454/2022565

Answer 4

回答by Craig

Have you looked at the Html Agility Pack(see https://stackoverflow.com/a/857926/618649)?

您是否查看过Html Agility Pack（参见https://stackoverflow.com/a/857926/618649）？

CsQueryalso looks pretty useful (basically use CSS selector-style syntax to get the elements). See https://stackoverflow.com/a/11090816/618649.

CsQuery看起来也非常有用（基本上使用 CSS 选择器样式的语法来获取元素）。请参阅https://stackoverflow.com/a/11090816/618649。

CsQuery is basically meant to be "jQuery for C#," which is pretty much the exact search criteria I used to find it.

CsQuery 基本上是“用于 C# 的 jQuery”，这几乎是我用来查找它的确切搜索条件。

If you could do this in a web browser, you could easily use jQuery, using syntax similar to $("div").each(function(idx){ alert( idx + ": " + $(this).text()); }(only you would obviously output the result to the log, or the screen, or make a web service call with it, or whatever you need to do with it).

如果您可以在 Web 浏览器中执行此操作，则可以轻松地使用 jQuery，使用类似于$("div").each(function(idx){ alert( idx + ": " + $(this).text()); }（只有您显然会将结果输出到日志或屏幕，或使用它进行 Web 服务调用，或您需要的任何内容）用它）。

Answer 5

回答by Tri Nguyen Dung

I think this code should work:

我认为这段代码应该有效：

string htmlSource = "<div>first html tag</div><div>another tag</div>";
string pattern = @"<div[^>]*?>(.*?)</div>";
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
ArrayList l = new ArrayList();
foreach (Match match in matches)
 {
   l.Add(match.Groups[1].Value);
 }

Answer 6

回答by Mehdi Dehghani

As other guys didn't mention HTML tags with attributes, here is my solution to deal with that:

正如其他人没有提到的HTML tags with attributes，这是我的解决方案：

// <TAG(.*?)>(.*?)</TAG>
// Example
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World

Answer 7

回答by Partha Mondal

I hope below regex will work:

我希望下面的正则表达式会起作用：

<div.*?>(.*?)<*.div>

You will get your desired output

你会得到你想要的输出

This is a test This is ANOTHER test

这是一个测试这是另一个测试

C# 使用正则表达式获取多个 HTML 标签之间的文本

提问by Ben

回答by coolmine

回答by Mayman

回答by Tom Jacques

回答by Craig

回答by Tri Nguyen Dung

回答by Mehdi Dehghani

回答by Partha Mondal

相关推荐

最近更新

标签

C# 使用正则表达式获取多个 HTML 标签之间的文本

提问by Ben

回答by coolmine

回答by Mayman

回答by Tom Jacques

回答by Craig

回答by Tri Nguyen Dung

回答by Mehdi Dehghani

回答by Partha Mondal

相关推荐

C# 未声明配置元素

C# Directory.GetFiles：如何只获取文件名，而不是完整路径？

C# 禁用数据网格视图中的按钮列

C# 如何设置图表栏的宽度？

相关推荐

最近更新

标签