java 正则表达式匹配没有特定属性的 <a> html 标签

Question

提问by user2287359

In Java I need to match <a>tags in a string that do not have href attribute. For example in the following string:

在 Java 中，我需要匹配<a>没有 href 属性的字符串中的标签。例如在以下字符串中：

text <a class="aClass" href="#">link1</a> text <a class="aClass" target="_blank">link2</a> text

it should not match <a class="aClass" href="#">link1</a>(because it contains href) but it should match <a class="aClass" target="_blank">link2</a>(because it does not contain href).

它不应该匹配<a class="aClass" href="#">link1</a>（因为它包含 href）但它应该匹配<a class="aClass" target="_blank">link2</a>（因为它不包含 href）。

I managed to build the RegEx to match my tags:

我设法构建了 RegEx 以匹配我的标签：

<a[^>]*>(.*?)</a>

but I can not figure out how to eliminate tags with href

但我不知道如何用 href 消除标签

(I know I can use HTML parsers etc but I need to do this with RegEx.

（我知道我可以使用 HTML 解析器等，但我需要使用 RegEx 来做到这一点。

Answer 1

回答by Ro Yo Mi

Description

描述

Be careful with regexs like <a[^>]*as these will also match other valid html tags which start with an asuch as <abbr>or <address>. Also simply looking for the existence of the string hrefisn't good enough as that string could be inside the value of another attribute or such as <a class="thishrefstuff"..., or part of another attribute like <a hreflang="en"...

小心像<a[^>]*这样的正则表达式，因为它们也会匹配其他以a诸如<abbr>或开头的有效html标签<address>。同样，仅仅查找字符串的存在href还不够好，因为该字符串可能位于另一个属性的值内，例如<a class="thishrefstuff"...，或另一个属性的一部分，例如<a hreflang="en"...

This expression will:

这个表达式将：

match all anchor tags <a...</a>which don't contain a hrefattribute.
It will enforce the tag name is aand not a tag which simply starts with the letter alike <address>
ignore attributes which also have the substring hrefembedded in the name of the attribute like the valid hreflang='en'or the made up Attributehref="some value".
ignore all characters inside the value portion of all properly formatted attributes like bogus='href=""'

匹配所有锚标签<a......</a>不包含href属性。
它将强制使用标签名称 isa而不是一个以字母开头的标签，a例如<address>
忽略在属性href名称中嵌入了子字符串的属性，例如 validhreflang='en'或 made up Attributehref="some value"。
忽略所有格式正确的属性的值部分内的所有字符，例如 bogus='href=""'

<a(?=\s|>)(?!(?:[^>=]|=(['"])(?:(?!\1).)*\1)*?\shref=['"])[^>]*>.*?<\/a>

enter image description here

在此处输入图片说明

Expanded

展开

<a(?=\s|>)match the open tag and ensure the next after the tag name is either a space or the close bracket, this forces the name to be aand not something else
(?!start the negative look ahead this if we find an href in this tag then this type of tag isn't the tag we're looking for
- (?:start non capture group to move through all characters inside the tag
- [^>=]match all non tag closing characters which prevents the regex engine from leaving the tag, and non equal signs which prevents the engine from continuing blindly matching all characters
- |or
- =(['"])match an equal sign followed by an open double or single quote. the quote is captured into group 2 so it can be correctly paired later
- (?:(?!\1).)*match all characters which are not the a close quote that matches the open quote
- \1match the correct close quote
- )*?close the non capture group and repeat is as often as necessary until
- \shref=['"]matching the desired href attribute. The \sand =["']ensures the attribute name is simply href
- )close the negative lookahead
[^>]*>.*?<\/a>match the entire string from open to close

<a(?=\s|>)匹配打开的标签并确保标签名称之后的下一个是空格或右括号，这会强制名称是a而不是其他东西
(?!如果我们在这个标签中找到了一个 href ，那么开始负面展望这个标签，那么这种类型的标签不是我们正在寻找的标签
- (?:启动非捕获组以移动标签内的所有字符
- [^>=]匹配所有防止正则表达式引擎离开标签的非标签结束字符，以及防止引擎继续盲目匹配所有字符的非等号
- |或者
- =(['"])匹配等号后跟一个开放的双引号或单引号。报价被捕获到组 2 中，以便以后可以正确配对
- (?:(?!\1).)*匹配不是与开引号匹配的闭引号的所有字符
- \1匹配正确的关闭引号
- )*?关闭非捕获组并根据需要重复，直到
- \shref=['"]匹配所需的 href 属性。在\s和=["']确保属性名称就是HREF
- )关闭负面前瞻
[^>]*>.*?<\/a>匹配从打开到关闭的整个字符串

Java Code Example:

Java代码示例：

Input text

输入文本

<abbr>RADIO</abbr> text <a class="aClass" href="#">link1</a> text <a bogus='href=""' class="aClass" target="_blank">link2</a> text

Code

代码

If you're looking to use this in a replace function to remove non-href-anchor tags then just replace all matches with nothing.

如果您希望在替换函数中使用它来删除非 href-anchor 标签，那么只需将所有匹配项替换为空即可。

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<a(?=\s|>)(?!(?:[^>=]|=(['\"])(?:(?!\1).)*\1)*?\shref=['\"])[^>]*>.*?<\/a>
",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Matches

火柴

$matches Array:
(
    [0] => Array
        (
            [0] => <a bogus='href=""' class="aClass" target="_blank">link2</a>
        )

    [1] => Array
        (
            [0] => 
        )

)

Answer 2

回答by Explosion Pills

I find it odd that you would needto do it with regex, but you can use a negative lookahead.

我觉得你需要用正则表达式来做这件事很奇怪，但你可以使用负前瞻。

<a(?![^>]+href).*?>(.*?)</a>

Answer 3

回答by Casimir et Hippolyte

I am not a java expert, but you can try something like this:

我不是 Java 专家，但您可以尝试以下操作：

String regex = new String("(?i)<a(?>[^h>]++|(?<! )h++|h++(?!ref\s*+=))*>((?>[^<]++|<(?!/a>))*)</a>");
String replacement = new String("");
str.replaceAll(regex,replacement);

Answer 4

回答by David says Reinstate Monica

One option you have is to first match alltags and then use a regex to match the ones that have so that you can ignore them. So your pseudo code would look like:

您的一个选择是首先匹配所有标签，然后使用正则表达式来匹配具有的标签，以便您可以忽略它们。所以你的伪代码看起来像：

<a>tags = html.find(all<a>tags);
for(String <a>tag : <a>tags){
    if(<a>tag.isHref()) continue;
    //do proccessing
}

java 正则表达式匹配没有特定属性的 <a> html 标签

提问by user2287359

回答by Ro Yo Mi

Description

描述

Expanded

展开

Java Code Example:

Java代码示例：

回答by Explosion Pills

回答by Casimir et Hippolyte

回答by David says Reinstate Monica

相关推荐

最近更新

标签

java 正则表达式匹配没有特定属性的 <a> html 标签

提问by user2287359

回答by Ro Yo Mi

Description

描述

Expanded

展开

Java Code Example:

Java代码示例：

回答by Explosion Pills

回答by Casimir et Hippolyte

回答by David says Reinstate Monica

相关推荐

java 没有大括号的嵌套 if-else 行为

java add-user.bat on JBoss-as-7.1.1.Final,系统找不到指定的路径

java 字符串方法 Append() : StringBuilder vs StringBuffer

java 多文件上传使用@Context HttpServletRequest 和@FormDataParam 在球衣

相关推荐

最近更新

标签