java 正则表达式匹配没有特定属性的 <a> html 标签
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17200485/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
RegEx to match <a> html tags without specific attribute
提问by user2287359
In Java I need to match <a>
tags in a string that do not have href attribute. For example in the following string:
在 Java 中,我需要匹配<a>
没有 href 属性的字符串中的标签。例如在以下字符串中:
text <a class="aClass" href="#">link1</a> text <a class="aClass" target="_blank">link2</a> text
it should not match <a class="aClass" href="#">link1</a>
(because it contains href) but it should match <a class="aClass" target="_blank">link2</a>
(because it does not contain href).
它不应该匹配<a class="aClass" href="#">link1</a>
(因为它包含 href)但它应该匹配<a class="aClass" target="_blank">link2</a>
(因为它不包含 href)。
I managed to build the RegEx to match my tags:
我设法构建了 RegEx 以匹配我的标签:
<a[^>]*>(.*?)</a>
but I can not figure out how to eliminate tags with href
但我不知道如何用 href 消除标签
(I know I can use HTML parsers etc but I need to do this with RegEx.
(我知道我可以使用 HTML 解析器等,但我需要使用 RegEx 来做到这一点。
回答by Ro Yo Mi
Description
描述
Be careful with regexs like <a[^>]*
as these will also match other valid html tags which start with an a
such as <abbr>
or <address>
. Also simply looking for the existence of the string href
isn't good enough as that string could be inside the value of another attribute or such as <a class="thishrefstuff"...
, or part of another attribute like <a hreflang="en"...
小心像<a[^>]*
这样的正则表达式,因为它们也会匹配其他以a
诸如<abbr>
或开头的有效html标签<address>
。同样,仅仅查找字符串的存在href
还不够好,因为该字符串可能位于另一个属性的值内,例如<a class="thishrefstuff"...
,或另一个属性的一部分,例如<a hreflang="en"...
This expression will:
这个表达式将:
- match all anchor tags
<a
...</a>
which don't contain ahref
attribute. - It will enforce the tag name is
a
and not a tag which simply starts with the lettera
like<address>
- ignore attributes which also have the substring
href
embedded in the name of the attribute like the validhreflang='en'
or the made upAttributehref="some value"
. - ignore all characters inside the value portion of all properly formatted attributes like
bogus='href=""'
- 匹配所有锚标签
<a
......</a>
不包含href
属性。 - 它将强制使用标签名称 is
a
而不是一个以字母开头的标签,a
例如<address>
- 忽略在属性
href
名称中嵌入了子字符串的属性,例如 validhreflang='en'
或 made upAttributehref="some value"
。 - 忽略所有格式正确的属性的值部分内的所有字符,例如
bogus='href=""'
<a(?=\s|>)(?!(?:[^>=]|=(['"])(?:(?!\1).)*\1)*?\shref=['"])[^>]*>.*?<\/a>
<a(?=\s|>)(?!(?:[^>=]|=(['"])(?:(?!\1).)*\1)*?\shref=['"])[^>]*>.*?<\/a>
Expanded
展开
<a(?=\s|>)
match the open tag and ensure the next after the tag name is either a space or the close bracket, this forces the name to bea
and not something else(?!
start the negative look ahead this if we find an href in this tag then this type of tag isn't the tag we're looking for(?:
start non capture group to move through all characters inside the tag[^>=]
match all non tag closing characters which prevents the regex engine from leaving the tag, and non equal signs which prevents the engine from continuing blindly matching all characters|
or=(['"])
match an equal sign followed by an open double or single quote. the quote is captured into group 2 so it can be correctly paired later(?:(?!\1).)*
match all characters which are not the a close quote that matches the open quote\1
match the correct close quote)*?
close the non capture group and repeat is as often as necessary until\shref=['"]
matching the desired href attribute. The\s
and=["']
ensures the attribute name is simply href)
close the negative lookahead
[^>]*>.*?<\/a>
match the entire string from open to close
<a(?=\s|>)
匹配打开的标签并确保标签名称之后的下一个是空格或右括号,这会强制名称是a
而不是其他东西(?!
如果我们在这个标签中找到了一个 href ,那么开始负面展望这个标签,那么这种类型的标签不是我们正在寻找的标签(?:
启动非捕获组以移动标签内的所有字符[^>=]
匹配所有防止正则表达式引擎离开标签的非标签结束字符,以及防止引擎继续盲目匹配所有字符的非等号|
或者=(['"])
匹配等号后跟一个开放的双引号或单引号。报价被捕获到组 2 中,以便以后可以正确配对(?:(?!\1).)*
匹配不是与开引号匹配的闭引号的所有字符\1
匹配正确的关闭引号)*?
关闭非捕获组并根据需要重复,直到\shref=['"]
匹配所需的 href 属性。在\s
和=["']
确保属性名称就是HREF)
关闭负面前瞻
[^>]*>.*?<\/a>
匹配从打开到关闭的整个字符串
Java Code Example:
Java代码示例:
Input text
输入文本
<abbr>RADIO</abbr> text <a class="aClass" href="#">link1</a> text <a bogus='href=""' class="aClass" target="_blank">link2</a> text
<abbr>RADIO</abbr> text <a class="aClass" href="#">link1</a> text <a bogus='href=""' class="aClass" target="_blank">link2</a> text
Code
代码
If you're looking to use this in a replace function to remove non-href-anchor tags then just replace all matches with nothing.
如果您希望在替换函数中使用它来删除非 href-anchor 标签,那么只需将所有匹配项替换为空即可。
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
public static void main(String[] asd){
String sourcestring = "source string to match with pattern";
Pattern re = Pattern.compile("<a(?=\s|>)(?!(?:[^>=]|=(['\"])(?:(?!\1).)*\1)*?\shref=['\"])[^>]*>.*?<\/a>
",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Matcher m = re.matcher(sourcestring);
int mIdx = 0;
while (m.find()){
for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
}
mIdx++;
}
}
}
Matches
火柴
$matches Array:
(
[0] => Array
(
[0] => <a bogus='href=""' class="aClass" target="_blank">link2</a>
)
[1] => Array
(
[0] =>
)
)
回答by Explosion Pills
I find it odd that you would needto do it with regex, but you can use a negative lookahead.
我觉得你需要用正则表达式来做这件事很奇怪,但你可以使用负前瞻。
<a(?![^>]+href).*?>(.*?)</a>
回答by Casimir et Hippolyte
I am not a java expert, but you can try something like this:
我不是 Java 专家,但您可以尝试以下操作:
String regex = new String("(?i)<a(?>[^h>]++|(?<! )h++|h++(?!ref\s*+=))*>((?>[^<]++|<(?!/a>))*)</a>");
String replacement = new String("");
str.replaceAll(regex,replacement);
回答by David says Reinstate Monica
One option you have is to first match alltags and then use a regex to match the ones that have so that you can ignore them. So your pseudo code would look like:
您的一个选择是首先匹配所有标签,然后使用正则表达式来匹配具有的标签,以便您可以忽略它们。所以你的伪代码看起来像:
<a>tags = html.find(all<a>tags);
for(String <a>tag : <a>tags){
if(<a>tag.isHref()) continue;
//do proccessing
}