Java 从 HTML 标签获取属性的正则表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1079423/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regular expression to get an attribute from HTML tag
提问by Krishna Kumar
I am looking for a regular expression that can get me src (case insensitive) tag from following HTML snippets in java.
我正在寻找一个正则表达式,它可以从 Java 中的以下 HTML 片段中获取 src(不区分大小写)标签。
<html><img src="kk.gif" alt="text"/></html>
<html><img src='kk.gif' alt="text"/></html>
<html><img src = "kk.gif" alt="text"/></html>
采纳答案by DMI
One possibility:
一种可能:
String imgRegex = "<img[^>]+src\s*=\s*['\"]([^'\"]+)['\"][^>]*>";
is a possibility (if matched case-insensitively). It's a bit of a mess, and deliberately ignores the case where quotes aren't used. To represent it without worrying about string escapes:
是一种可能性(如果不区分大小写匹配)。有点乱,故意忽略不使用引号的情况。要在不担心字符串转义的情况下表示它:
<img[^>]+src\s*=\s*['"]([^'"]+)['"][^>]*>
This matches:
这匹配:
<img
- one or more characters that aren't
>
(i.e. possible other attributes) src
- optional whitespace
=
- optional whitespace
- starting delimiter of
'
or"
- image source(which may not include a single or double quote)
- ending delimiter
- although the expression can stop here, I then added:
- zero or more characters that are not
>
(more possible attributes) >
to close the tag
- zero or more characters that are not
<img
- 一个或多个不是的字符
>
(即可能的其他属性) src
- 可选空格
=
- 可选空格
'
或的起始分隔符"
- 图片来源(可能不包括单引号或双引号)
- 结束分隔符
- 虽然表达式可以停在这里,但我接着补充说:
- 零个或多个不是的字符
>
(更多可能的属性) >
关闭标签
- 零个或多个不是的字符
Things to note:
注意事项:
- If you want to include the
src=
as well, move the open bracket further left :-) - This does not care about delimiter balancing or attribute values without delimiters, and it can also choke on badly-formed attributes (such as attributes that include
>
or image sources that include'
or"
). - Parsing HTML with regular expressions like this is non-trivial, and at best a quick hack that works in the majority of cases.
- 如果您还想包含
src=
,请将左括号移到更远的地方:-) - 这并不关心平衡定界符或无定界符属性值,并且它也可以呛严重形成的属性(例如包括属性
>
包括或图像源'
或"
)。 - 使用像这样的正则表达式解析 HTML 并非易事,充其量是在大多数情况下都有效的快速技巧。
回答by cletus
This question comes up a lot here.
这个问题在这里经常出现。
Regular expressions are a badway of handling this problem. Do yourself a favour and use an HTML parser of some kind.
正则表达式是处理这个问题的糟糕方法。帮自己一个忙,使用某种 HTML 解析器。
Regexes are flaky for parsing HTML. You'll end up with a complicated expression that'll behave unexpectedly in some corner cases that willhappen otherwise.
正则表达式在解析 HTML 时很不稳定。您最终会得到一个复杂的表达式,该表达式会在某些极端情况下表现出乎意料,否则会发生。
Edit:Ifyour HTML is that simple then:
编辑:如果您的 HTML 如此简单,那么:
Pattern p = Pattern.compile("src\s*=\s*([\"'])?([^ \"']*)");
Matcher m = p.matcher(str);
if (m.find()) {
String src = m.group(2);
}
And there are any number of Java HTML parsersout there.
回答by Mnementh
You mean the src-attribute of the img-Tag? In that case you can go with the following:
你是说 img-Tag 的 src-attribute 吗?在这种情况下,您可以使用以下方法:
<[Ii][Mm][Gg]\s*([Ss][Rr][Cc]\s*=\s*[\"'].*?[\"'])
That should work. The expression src='...' is in parantheses, so it is a matcher-group and can be processed separately.
那应该工作。表达式 src='...' 在括号中,因此它是一个匹配器组,可以单独处理。
回答by Shree Krishna
This answer is for google searchers, Because it's too late
这个答案是针对谷歌搜索者的,因为为时已晚
Copying cletus's showed error and
Modifying his answer and passing modified String src\\s*=\\s*([\"'])?([^\"']*)
as parameter passed into Pattern.compile
worked for me,
复制 cletus 的显示错误并修改他的答案并将修改后的字符串src\\s*=\\s*([\"'])?([^\"']*)
作为参数传递Pattern.compile
给我工作,
Here is the full example
这是完整的例子
String htmlString = "<div class=\"current\"><img src=\"img/HomePageImages/Paris.jpg\"></div>"; //Sample HTML
String ptr= "src\s*=\s*([\"'])?([^\"']*)";
Pattern p = Pattern.compile(ptr);
Matcher m = p.matcher(htmlString);
if (m.find()) {
String src = m.group(2); //Result
}