Java 从 HTML 标签获取属性的正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1079423/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 23:16:46  来源:igfitidea点击:

Regular expression to get an attribute from HTML tag

javaregex

提问by Krishna Kumar

I am looking for a regular expression that can get me src (case insensitive) tag from following HTML snippets in java.

我正在寻找一个正则表达式,它可以从 Java 中的以下 HTML 片段中获取 src(不区分大小写)标签。

<html><img src="kk.gif" alt="text"/></html>
<html><img src='kk.gif' alt="text"/></html>
<html><img src = "kk.gif" alt="text"/></html>

采纳答案by DMI

One possibility:

一种可能:

String imgRegex = "<img[^>]+src\s*=\s*['\"]([^'\"]+)['\"][^>]*>";

is a possibility (if matched case-insensitively). It's a bit of a mess, and deliberately ignores the case where quotes aren't used. To represent it without worrying about string escapes:

是一种可能性(如果不区分大小写匹配)。有点乱,故意忽略不使用引号的情况。要在不担心字符串转义的情况下表示它:

<img[^>]+src\s*=\s*['"]([^'"]+)['"][^>]*>

This matches:

这匹配:

  • <img
  • one or more characters that aren't >(i.e. possible other attributes)
  • src
  • optional whitespace
  • =
  • optional whitespace
  • starting delimiter of 'or "
  • image source(which may not include a single or double quote)
  • ending delimiter
  • although the expression can stop here, I then added:
    • zero or more characters that are not >(more possible attributes)
    • >to close the tag
  • <img
  • 一个或多个不是的字符>(即可能的其他属性)
  • src
  • 可选空格
  • =
  • 可选空格
  • '或的起始分隔符"
  • 图片来源(可能不包括单引号或双引号)
  • 结束分隔符
  • 虽然表达式可以停在这里,但我接着补充说:
    • 零个或多个不是的字符>(更多可能的属性)
    • >关闭标签

Things to note:

注意事项:

  • If you want to include the src=as well, move the open bracket further left :-)
  • This does not care about delimiter balancing or attribute values without delimiters, and it can also choke on badly-formed attributes (such as attributes that include >or image sources that include 'or ").
  • Parsing HTML with regular expressions like this is non-trivial, and at best a quick hack that works in the majority of cases.
  • 如果您还想包含src=,请将左括号移到更远的地方:-)
  • 这并不关心平衡定界符或无定界符属性值,并且它也可以呛严重形成的属性(例如包括属性>包括或图像源'")。
  • 使用像这样的正则表达式解析 HTML 并非易事,充其量是在大多数情况下都有效的快速技巧。

回答by cletus

This question comes up a lot here.

这个问题在这里经常出现。

Regular expressions are a badway of handling this problem. Do yourself a favour and use an HTML parser of some kind.

正则表达式是处理这个问题的糟糕方法。帮自己一个忙,使用某种 HTML 解析器。

Regexes are flaky for parsing HTML. You'll end up with a complicated expression that'll behave unexpectedly in some corner cases that willhappen otherwise.

正则表达式在解析 HTML 时很不稳定。您最终会得到一个复杂的表达式,该表达式会在某些极端情况下表现出乎意料,否则发生。

Edit:Ifyour HTML is that simple then:

编辑:如果您的 HTML 如此简单,那么:

Pattern p = Pattern.compile("src\s*=\s*([\"'])?([^ \"']*)");
Matcher m = p.matcher(str);
if (m.find()) {
  String src = m.group(2);
}

And there are any number of Java HTML parsersout there.

并且有任意数量的 Java HTML 解析器

回答by Mnementh

You mean the src-attribute of the img-Tag? In that case you can go with the following:

你是说 img-Tag 的 src-attribute 吗?在这种情况下,您可以使用以下方法:

<[Ii][Mm][Gg]\s*([Ss][Rr][Cc]\s*=\s*[\"'].*?[\"'])

That should work. The expression src='...' is in parantheses, so it is a matcher-group and can be processed separately.

那应该工作。表达式 src='...' 在括号中,因此它是一个匹配器组,可以单独处理。

回答by Shree Krishna

This answer is for google searchers, Because it's too late

这个答案是针对谷歌搜索者的,因为为时已晚

Copying cletus's showed error and Modifying his answer and passing modified String src\\s*=\\s*([\"'])?([^\"']*)as parameter passed into Pattern.compileworked for me,

复制 cletus 的显示错误并修改他的答案并将修改后的字符串src\\s*=\\s*([\"'])?([^\"']*)作为参数传递Pattern.compile给我工作,

Here is the full example

这是完整的例子

    String htmlString = "<div class=\"current\"><img src=\"img/HomePageImages/Paris.jpg\"></div>"; //Sample HTML

    String ptr= "src\s*=\s*([\"'])?([^\"']*)";
    Pattern p = Pattern.compile(ptr);
    Matcher m = p.matcher(htmlString);
    if (m.find()) {
        String src = m.group(2); //Result
    }