java 正则表达式匹配没有特定属性的 <a> html 标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17200485/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-01 01:17:41  来源:igfitidea点击:

RegEx to match <a> html tags without specific attribute

javaregex

提问by user2287359

In Java I need to match <a>tags in a string that do not have href attribute. For example in the following string:

在 Java 中,我需要匹配<a>没有 href 属性的字符串中的标签。例如在以下字符串中:

text <a class="aClass" href="#">link1</a> text <a class="aClass" target="_blank">link2</a> text

it should not match <a class="aClass" href="#">link1</a>(because it contains href) but it should match <a class="aClass" target="_blank">link2</a>(because it does not contain href).

它不应该匹配<a class="aClass" href="#">link1</a>(因为它包含 href)但它应该匹配<a class="aClass" target="_blank">link2</a>(因为它不包含 href)。

I managed to build the RegEx to match my tags:

我设法构建了 RegEx 以匹配我的标签:

<a[^>]*>(.*?)</a>

but I can not figure out how to eliminate tags with href

但我不知道如何用 href 消除标签

(I know I can use HTML parsers etc but I need to do this with RegEx.

(我知道我可以使用 HTML 解析器等,但我需要使用 RegEx 来做到这一点。

回答by Ro Yo Mi

Description

描述

Be careful with regexs like <a[^>]*as these will also match other valid html tags which start with an asuch as <abbr>or <address>. Also simply looking for the existence of the string hrefisn't good enough as that string could be inside the value of another attribute or such as <a class="thishrefstuff"..., or part of another attribute like <a hreflang="en"...

小心像<a[^>]*这样的正则表达式,因为它们也会匹配其他以a诸如<abbr>或开头的有效html标签<address>。同样,仅仅查找字符串的存在href还不够好,因为该字符串可能位于另一个属性的值内,例如<a class="thishrefstuff"...,或另一个属性的一部分,例如<a hreflang="en"...

This expression will:

这个表达式将:

  • match all anchor tags <a...</a>which don't contain a hrefattribute.
  • It will enforce the tag name is aand not a tag which simply starts with the letter alike <address>
  • ignore attributes which also have the substring hrefembedded in the name of the attribute like the valid hreflang='en'or the made up Attributehref="some value".
  • ignore all characters inside the value portion of all properly formatted attributes like bogus='href=""'
  • 匹配所有锚标签<a......</a>不包含href属性。
  • 它将强制使用标签名称 isa而不是一个以字母开头的标签,a例如<address>
  • 忽略在属性href名称中嵌入了子字符串的属性,例如 validhreflang='en'或 made up Attributehref="some value"
  • 忽略所有格式正确的属性的值部分内的所有字符,例如 bogus='href=""'

<a(?=\s|>)(?!(?:[^>=]|=(['"])(?:(?!\1).)*\1)*?\shref=['"])[^>]*>.*?<\/a>

<a(?=\s|>)(?!(?:[^>=]|=(['"])(?:(?!\1).)*\1)*?\shref=['"])[^>]*>.*?<\/a>

enter image description here

在此处输入图片说明

Expanded

展开

  • <a(?=\s|>)match the open tag and ensure the next after the tag name is either a space or the close bracket, this forces the name to be aand not something else
  • (?!start the negative look ahead this if we find an href in this tag then this type of tag isn't the tag we're looking for
    • (?:start non capture group to move through all characters inside the tag
    • [^>=]match all non tag closing characters which prevents the regex engine from leaving the tag, and non equal signs which prevents the engine from continuing blindly matching all characters
    • |or
    • =(['"])match an equal sign followed by an open double or single quote. the quote is captured into group 2 so it can be correctly paired later
    • (?:(?!\1).)*match all characters which are not the a close quote that matches the open quote
    • \1match the correct close quote
    • )*?close the non capture group and repeat is as often as necessary until
    • \shref=['"]matching the desired href attribute. The \sand =["']ensures the attribute name is simply href
    • )close the negative lookahead
  • [^>]*>.*?<\/a>match the entire string from open to close
  • <a(?=\s|>)匹配打开的标签并确保标签名称之后的下一个是空格或右括号,这会强制名称是a而不是其他东西
  • (?!如果我们在这个标签中找到了一个 href ,那么开始负面展望这个标签,那么这种类型的标签不是我们正在寻找的标签
    • (?:启动非捕获组以移动标签内的所有字符
    • [^>=]匹配所有防止正则表达式引擎离开标签的非标签结束字符,以及防止引擎继续盲目匹配所有字符的非等号
    • |或者
    • =(['"])匹配等号后跟一个开放的双引号或单引号。报价被捕获到组 2 中,以便以后可以正确配对
    • (?:(?!\1).)*匹配不是与开引号匹配的闭引号的所有字符
    • \1匹配正确的关闭引号
    • )*?关闭非捕获组并根据需要重复,直到
    • \shref=['"]匹配所需的 href 属性。在\s=["']确保属性名称就是HREF
    • )关闭负面前瞻
  • [^>]*>.*?<\/a>匹配从打开到关闭的整个字符串

Java Code Example:

Java代码示例:

Input text

输入文本

<abbr>RADIO</abbr> text <a class="aClass" href="#">link1</a> text <a bogus='href=""' class="aClass" target="_blank">link2</a> text

<abbr>RADIO</abbr> text <a class="aClass" href="#">link1</a> text <a bogus='href=""' class="aClass" target="_blank">link2</a> text

Code

代码

If you're looking to use this in a replace function to remove non-href-anchor tags then just replace all matches with nothing.

如果您希望在替换函数中使用它来删除非 href-anchor 标签,那么只需将所有匹配项替换为空即可。

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<a(?=\s|>)(?!(?:[^>=]|=(['\"])(?:(?!\1).)*\1)*?\shref=['\"])[^>]*>.*?<\/a>
",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Matches

火柴

$matches Array:
(
    [0] => Array
        (
            [0] => <a bogus='href=""' class="aClass" target="_blank">link2</a>
        )

    [1] => Array
        (
            [0] => 
        )

)

回答by Explosion Pills

I find it odd that you would needto do it with regex, but you can use a negative lookahead.

我觉得你需要用正则表达式来做这件事很奇怪,但你可以使用负前瞻。

<a(?![^>]+href).*?>(.*?)</a>

回答by Casimir et Hippolyte

I am not a java expert, but you can try something like this:

我不是 Java 专家,但您可以尝试以下操作:

String regex = new String("(?i)<a(?>[^h>]++|(?<! )h++|h++(?!ref\s*+=))*>((?>[^<]++|<(?!/a>))*)</a>");
String replacement = new String("");
str.replaceAll(regex,replacement);

回答by David says Reinstate Monica

One option you have is to first match alltags and then use a regex to match the ones that have so that you can ignore them. So your pseudo code would look like:

您的一个选择是首先匹配所有标签,然后使用正则表达式来匹配具有的标签,以便您可以忽略它们。所以你的伪代码看起来像:

<a>tags = html.find(all<a>tags);
for(String <a>tag : <a>tags){
    if(<a>tag.isHref()) continue;
    //do proccessing
}