Java:我有一大串 html,需要提取 href="..." 文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1670593/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 17:31:52  来源:igfitidea点击:

Java: I have a big string of html and need to extract the href="..." text

javahtmlregexhtml-parsing

提问by Legend

I have this string containing a large chunk of html and am trying to extract the link from href="..." portion of the string. The href could be in one of the following forms:

我有一个包含大量 html 的字符串,我正在尝试从字符串的 href="..." 部分提取链接。href 可以采用以下形式之一:

<a href="..." />
<a class="..." href="..." />

I don't really have a problem with regex but for some reason when I use the following code:

我对正则表达式并没有真正的问题,但出于某种原因,当我使用以下代码时:

        String innerHTML = getHTML(); 
  Pattern p = Pattern.compile("href=\"(.*)\"", Pattern.DOTALL);
  Matcher m = p.matcher(innerHTML);
  if (m.find()) {
   // Get all groups for this match
   for (int i=0; i<=m.groupCount(); i++) {
    String groupStr = m.group(i);
    System.out.println(groupStr);

   }
  }

Can someone tell me what is wrong with my code? I did this stuff in php but in Java I am somehow doing something wrong... What is happening is that it prints the whole html string whenever I try to print it...

有人可以告诉我我的代码有什么问题吗?我在 php 中做了这些东西,但在 Java 中我以某种方式做错了......发生的事情是每当我尝试打印它时它都会打印整个 html 字符串......

EDIT: Just so that everyone knows what kind of a string I am dealing with:

编辑:只是为了让每个人都知道我正在处理什么样的字符串:

<a class="Wrap" href="item.php?id=43241"><input type="button">
    <span class="chevron"></span>
  </a>
  <div class="menu"></div>

Everytime I run the code, it prints the whole string... That's the problem...

每次我运行代码时,它都会打印整个字符串……这就是问题所在……

And about using jTidy... I'm on it but it would be interesting to know what went wrong in this case as well...

关于使用 jTidy ......我正在研究它,但知道在这种情况下出了什么问题也会很有趣......

回答by Kugel

.* 

This is an greedy operation that will take any character including the quotes.

这是一个贪婪的操作,它将采用包括引号在内的任何字符。

Try something like:

尝试类似:

"href=\"([^\"]*)\""

回答by Phil Ross

There are two problems with the code you've posted:

您发布的代码有两个问题:

Firstly the .*in your regular expression is greedy. This will cause it to match all characters until the last "character that can be found. You can make this match be non-greedy by changing this to .*?.

首先.*,你的正则表达式是贪婪的。这将导致它匹配所有字符,直到"可以找到最后一个字符。您可以通过将其更改为 来使此匹配变得非贪婪.*?

Secondly, to pick up all the matches, you need to keep iterating with Matcher.findrather than looking for groups. Groups give you access to each parenthesized section of the regex. You however, are looking for each time the whole regular expression matches.

其次,要获取所有匹配项,您需要不断迭代Matcher.find而不是寻找组。组使您可以访问正则表达式的每个括号部分。但是,每次整个正则表达式匹配时,您都在寻找。

Putting these together gives you the following code which should do what you need:

将这些放在一起为您提供以下代码,它们应该可以满足您的需求:

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);

while (m.find()) 
{
    System.out.println(m.group(1));
}

回答by BalusC

Regex is great but not the right tool for this particular purpose. Normally you want to use a stackbased parser for this. Have a look at Java HTML parser API's like jTidy.

正则表达式很棒,但不是用于此特定目的的正确工具。通常,您希望为此使用基于堆栈的解析器。看看像jTidy这样的 Java HTML 解析器 API 。

回答by camickr

Use a built in parser. Something like:

使用内置的解析器。就像是:

    EditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
    doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
    kit.read(reader, doc, 0);

    HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);

    while (it.isValid())
    {
        SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
        String href = (String)s.getAttribute(HTML.Attribute.HREF);
        System.out.println( href );
        it.next();
    }

Or use the ParserCallback:

或者使用 ParserCallback:

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;

public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
    public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
    {
        if (tag.equals(HTML.Tag.A))
        {
            String href = (String)a.getAttribute(HTML.Attribute.HREF);
            System.out.println(href);
        }
    }

    public static void main(String[] args)
        throws Exception
    {
        Reader reader = getReader(args[0]);
        ParserCallbackText parser = new ParserCallbackText();
        new ParserDelegator().parse(reader, parser, true);
    }

    static Reader getReader(String uri)
        throws IOException
    {
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
        {
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else
        {
            return new FileReader(uri);
        }
    }
}

The Reader could be a StringReader.

Reader 可以是 StringReader。

回答by surajz

Another easy and reliable way to do it is by using Jsoup

另一种简单可靠的方法是使用Jsoup

Document doc = Jsoup.connect("http://example.com/").get();
Elements links = doc.select("a[href]");
for (Element link : links){
  System.out.println(link.attr("abs:href"));
}

回答by Lorenzo Boccaccia

you may use a html parser library. jtidyfor example gives you a DOM model of the html, from wich you can extract all "a" elements and read their "href" attribute

您可以使用 html 解析器库。例如,jtidy为您提供了 html 的 DOM 模型,您可以从中提取所有“a”元素并读取它们的“href”属性

回答by Denis Tulskiy

"href=\"(.*?)\""should also work, but I think Kugel's answer will work faster.

"href=\"(.*?)\""也应该有效,但我认为 Kugel 的答案会更快。