java 使用Java从html中提取锚标记

Question

提问by Ebbu Abraham

I have several anchor tags in a text,

我在一个文本中有几个锚标记，

Input: <a href="http://stackoverflow.com" >Take me to StackOverflow</a>

输入： <a href="http://stackoverflow.com" >Take me to StackOverflow</a>

Output: http://stackoverflow.com

输出： http://stackoverflow.com

How can I find all those input strings and convert it to the output string in java, without using a 3rd party API ???

如何在不使用 3rd 方 API 的情况下找到所有这些输入字符串并将其转换为 java 中的输出字符串？？？

Answer 1

回答by Bart Kiers

There are classes in the core API that you can use to get all hrefattributes from anchor tags (if present!):

核心 API 中有一些类可用于href从锚标记（如果存在！）获取所有属性：

import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class HtmlParseDemo {
   public static void main(String [] args) throws Exception {

       String html =
           "<a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> " +
           "<!--                                                               " +
           "<a href=\"http://ignoreme.com\" >...</a>                           " +
           "-->                                                                " +
           "<a href=\"http://www.google.com\" >Take me to Google</a>           " +
           "<a>NOOOoooo!</a>                                                   ";

       Reader reader = new StringReader(html);
       HTMLEditorKit.Parser parser = new ParserDelegator();
       final List<String> links = new ArrayList<String>();

       parser.parse(reader, new HTMLEditorKit.ParserCallback(){
           public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
               if(t == HTML.Tag.A) {
                   Object link = a.getAttribute(HTML.Attribute.HREF);
                   if(link != null) {
                       links.add(String.valueOf(link));
                   }
               }
           }
       }, true);

       reader.close();
       System.out.println(links);
   }
}

which will print:

这将打印：

[http://stackoverflow.com, http://www.google.com]

Answer 2

回答by Op De Cirkel

public static void main(String[] args) {
    String test = "qazwsx<a href=\"http://stackoverflow.com\">Take me to StackOverflow</a>fdgfdhgfd"
            + "<a href=\"http://stackoverflow2.com\">Take me to StackOverflow2</a>dcgdf";

    String regex = "<a href=(\"[^\"]*\")[^<]*</a>";

    Pattern p = Pattern.compile(regex);

    Matcher m = p.matcher(test);
    System.out.println(m.replaceAll(""));
}

NOTE:All Andrzej Doyle's points are valid and if you have more then simple <a href="X">Y</a>in your input, and you are sure that is parsable HTML, then you are better with HTML parser.

注意：Andrzej Doyle 的所有观点都是有效的，如果您<a href="X">Y</a>的输入内容更简单，并且您确定是可解析的 HTML，那么您最好使用 HTML 解析器。

To summarize:

总结一下：

The regex i posted doesn't work if you have <a>in comment. (you can treat it as special case)
It doesn't work if you have other attributes in the <a>tag. (again you can treat it as special case)
there are many other cases that regex wont work, and you can not cover all of them with regex, since HTML is not regular language.

如果您有<a>评论，我发布的正则表达式将不起作用。（您可以将其视为特例）
如果<a>标签中有其他属性，则它不起作用。（同样，您可以将其视为特例）
还有许多其他情况正则表达式不起作用，您不能用正则表达式覆盖所有这些情况，因为 HTML 不是常规语言。

However, if your req is always replace <a href="X">Y</a>with "X"without considering the context, then the code i've posted will work.

但是，如果您的 req 总是<a href="X">Y</a>在"X"不考虑上下文的情况下被替换，那么我发布的代码将起作用。

Answer 3

回答by Jigar Joshi

You can use JSoup

您可以使用JSoup

String html = "<p>An <a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String linkHref = link.attr("href"); // "http://stackoverflow.com"

Also See

另见

Example

例子

Answer 4

回答by Kristen Gillard

The above example works perfect; if you want to parse an HTML document say instead of concatenated strings, write something like this to compliment the code above.

上面的例子很完美；如果你想解析一个 HTML 文档，而不是连接字符串，写这样的东西来补充上面的代码。

Existing code above ~ modified to show: HtmlParser.java (HtmlParseDemo.java) above complementing code with HtmlPage.java below. The content of the HtmlPage.properties file is at the bottom of this page.

上面的现有代码 ~ 修改为显示： HtmlParser.java (HtmlParseDemo.java) 上面的代码与下面的 HtmlPage.java 互补。HtmlPage.properties 文件的内容位于此页面的底部。

The main.url property in the HtmlPage.properties file is: main.url=http://www.whatever.com/

HtmlPage.properties 文件中的 main.url 属性为： main.url=http://www.whatever.com/

That way you can just parse the url that your after. :-) Happy coding :-D

这样你就可以解析你之后的网址。:-) 快乐编码 :-D

import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HtmlParser
{
    public static void main(String[] args) throws Exception
    {
        String html = HtmlPage.getPage();

        Reader reader = new StringReader(html);
        HTMLEditorKit.Parser parser = new ParserDelegator();
        final List<String> links = new ArrayList<String>();

        parser.parse(reader, new HTMLEditorKit.ParserCallback()
        {
            public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
            {
                if (t == HTML.Tag.A)
                {
                    Object link = a.getAttribute(HTML.Attribute.HREF);
                    if (link != null)
                    {
                        links.add(String.valueOf(link));
                    }
                }
            }
        }, true);

        reader.close();

        // create the header
        System.out.println("<html>\n<head>\n   <title>Link City</title>\n</head>\n<body>");

        // spit out the links and create href
        for (String l : links)
        {
            System.out.print("   <a href=\"" + l + "\">" + l + "</a>\n");
        }

        // create footer
        System.out.println("</body>\n</html>");
    }
}

import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ResourceBundle;

public class HtmlPage
{
    public static String getPage()
    {
        StringWriter sw = new StringWriter();
        ResourceBundle bundle = ResourceBundle.getBundle(HtmlPage.class.getName().toString());

        try
        {
            URL url = new URL(bundle.getString("main.url"));

            HttpURLConnection connection = (HttpURLConnection) url.openConnection();
            connection.setRequestMethod("GET");
            connection.setDoOutput(true);

            InputStream content = (InputStream) connection.getInputStream();
            BufferedReader in = new BufferedReader(new InputStreamReader(content));

            String line;

            while ((line = in.readLine()) != null)
            {
                sw.append(line).append("\n");
            }

        } catch (Exception e)
        {
            e.printStackTrace();
        }

        return sw.getBuffer().toString();
    }
}

For example, this will output links from http://ebay.com.au/if viewed in a browser. This is a subset, as there are a lot of links

例如，如果在浏览器中查看，这将输出来自http://ebay.com.au/ 的链接。这是一个子集，因为有很多链接

    
    
       Link City
    
    
       #mainContent
       http://realestate.ebay.com.au/

Answer 5

回答by pap

The most robust way (as has been suggested already) is to use regular expressions (java.util.regexp), if you are required to build this without using 3d party libs.

如果您需要在不使用 3d 方库的情况下构建它，最可靠的方法（正如已经建议的那样）是使用正则表达式 (java.util.regexp)。

The alternative is to parse the html as XML, either using a SAX parser to capture and handle each instance of an "a" element or as a DOM Document and then searching it using XPATH (see http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/package-summary.html). This is problematic though, since it requires the HTML page to be fully XML compliant in markup, a very dangerous assumption and not an approach I would recommend since most "real" html pages are not XML compliant.

另一种方法是将 html 解析为 XML，使用 SAX 解析器捕获和处理“a”元素的每个实例或作为 DOM 文档，然后使用 XPATH 搜索它（请参阅http://download.oracle.com/ javase/6/docs/api/javax/xml/parsers/package-summary.html）。不过这是有问题的，因为它要求 HTML 页面在标记中完全符合 XML，这是一个非常危险的假设，而不是我推荐的方法，因为大多数“真实”的 html 页面都不符合 XML。

Still, I would recommend also looking at existing frameworks out there built for this purpose (like JSoup, also mentioned above). No need to reinvent the wheel.

尽管如此，我还是建议查看为此目的构建的现有框架（如上面提到的 JSoup）。无需重新发明轮子。

java 使用Java从html中提取锚标记

提问by Ebbu Abraham

回答by Bart Kiers

回答by Op De Cirkel

回答by Jigar Joshi

回答by Kristen Gillard

回答by pap

相关推荐

最近更新

标签

java 使用Java从html中提取锚标记

提问by Ebbu Abraham

回答by Bart Kiers

回答by Op De Cirkel

回答by Jigar Joshi

回答by Kristen Gillard

回答by pap

相关推荐

java 在java中的for循环中初始化对象？

java 垃圾收集日志条目“Full GC (System)”是否意味着某个名为 System.gc() 的类？

java 读取电子邮件的文本文件转换为 Javamail MimeMessage

java javac 无效标志

相关推荐

最近更新

标签