java 如何从 URL 获取 HTML 链接

Question

提问by careless_monkey

I'm just starting out on my Networking Assignment and I'm already stuck. Assignment asks me to check the user provided website for links and to determine if they are active or inactive by reading the header info. So far after googling, I just have this code which retrieves the website. I don't get how to go over this information and look for HTML links. Here's the code:

我刚刚开始我的网络分配，我已经卡住了。作业要求我检查用户提供的网站是否有链接，并通过阅读标题信息确定它们是活动的还是非活动的。到目前为止，在谷歌搜索之后，我只有这段代码来检索网站。我不知道如何查看这些信息并查找 HTML 链接。这是代码：

import java.net.*; 
import java.io.*; 

public class url_checker { 
    public static void main(String[] args) throws Exception { 
        URL yahoo = new URL("http://yahoo.com"); 
        URLConnection yc = yahoo.openConnection(); 
        BufferedReader in = new BufferedReader( 
                                new InputStreamReader( 
                                yc.getInputStream())); 
        String inputLine; 
        int count = 0; 
        while ((inputLine = in.readLine()) != null) { 
            System.out.println (inputLine);                
            }      
        in.close(); 
    } 
}

Please help. Thanks!

请帮忙。谢谢！

Answer 1

回答by Impiastro

You can also try jsouphtml retriever and parser.

您还可以尝试jsouphtml 检索器和解析器。

Document doc = Jsoup.parse(new URL("<url>"), 2000);

Elements resultLinks = doc.select("div.post-title > a");
for (Element link : resultLinks) {
    String href = link.attr("href");
    System.out.println("title: " + link.text());
    System.out.println("href: " + href);
}

With this code you can list and analyze all elements inside a div with class "post-title" from the url .

使用此代码，您可以列出和分析来自 url 的具有“post-title”类的 div 中的所有元素。

Answer 2

回答by Pooja Akshantal

You can try this:

你可以试试这个：

URL url = new URL(link);
Reader reader= new InputStreamReader((InputStream) url.getContent());
new ParserDelegator().parse(reader, new Page(), true);

Then Create a class called Page

然后创建一个名为Page的类

class Page extends HTMLEditorKit.ParserCallback {

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        if (t == HTML.Tag.A) {
            String link = null;
            Enumeration<?> attributeNames = a.getAttributeNames();
            if (attributeNames.nextElement().equals(HTML.Attribute.HREF))
                link = a.getAttribute(HTML.Attribute.HREF).toString();
            //save link some where 
        }
    }
}

Answer 3

回答by mtk

HtmlParseris what you need here. A lot of things can be done with it.

HtmlParser正是您所需要的。很多事情都可以用它来完成。

Answer 4

回答by camickr

I don't get how to go over this information and look for HTML links
I cannot use any external library on my Assignment

我不知道如何查看这些信息并查找 HTML 链接
我不能在我的作业中使用任何外部图书馆

You have a couple of options:

您有几个选择：

1) You can read the web page into an HTMLDocument. Then you can get an iterator from the Document to find all the HTML.Tag.A tags. Once you find the attrbute tags you can get the HTML.Attribute.HREF from the attribute set of the attribute tag.

1) 您可以将网页读入 HTMLDocument。然后你可以从 Document 得到一个迭代器来查找所有的 HTML.Tag.A 标签。找到属性标签后，您可以从属性标签的属性集中获取 HTML.Attribute.HREF。

2) You can extend HTMLEditor.ParserCallback and implement the handleStartTag(...) method. Then whenever you find an A tag, you can get the href attribute which will again contain the link. The basic code for invoking the parser callback is:

2) 您可以扩展 HTMLEditor.ParserCallback 并实现 handleStartTag(...) 方法。然后，无论何时找到 A 标记，您都可以获得 href 属性，该属性将再次包含链接。调用解析器回调的基本代码是：

MyParserCallback parser = new MyParserCallback();

// simple test
String file = "<html><head><here>abc<div>def</div></here></head></html>";
StringReader reader = new StringReader(file);

// read a page from the internet
//URLConnection conn = new URL("http://yahoo.com").openConnection();
//Reader reader = new InputStreamReader(conn.getInputStream());

try
{
    new ParserDelegator().parse(reader, parser, true);
}
catch (IOException e)
{
    System.out.println(e);
}

Answer 5

回答by SammoSammo

You need to get the HTTP status code that the server returned with the response. A server will return a 404 if the page does not exist.

您需要获取服务器随响应返回的 HTTP 状态代码。如果页面不存在，服务器将返回 404。

Check out this: http://download.oracle.com/javase/1.4.2/docs/api/java/net/HttpURLConnection.html

看看这个：http: //download.oracle.com/javase/1.4.2/docs/api/java/net/HttpURLConnection.html

most specifically the getResponseCode method.

最特别的是 getResponseCode 方法。

Answer 6

回答by Fabian Steeg

I would parse the HTML with a tool like NekoHTML. It basically fixes malformed HTML for you and allows to access it like XML. Then you can process the link elements and try to follow them like you did for the original page.

我会用像NekoHTML这样的工具解析 HTML 。它基本上为您修复了格式错误的 HTML，并允许像 XML 一样访问它。然后您可以处理链接元素并尝试像处理原始页面一样关注它们。

You can check out some sample code that does this.

您可以查看一些执行此操作的示例代码。

java 如何从 URL 获取 HTML 链接

提问by careless_monkey

回答by Impiastro

回答by Pooja Akshantal

回答by mtk

回答by camickr

回答by SammoSammo

回答by Fabian Steeg

相关推荐

最近更新

标签

java 如何从 URL 获取 HTML 链接

提问by careless_monkey

回答by Impiastro

回答by Pooja Akshantal

回答by mtk

回答by camickr

回答by SammoSammo

回答by Fabian Steeg

相关推荐

数组似乎在 Java 中通过引用传递，这怎么可能？

java 在屏幕上拖动 jlabel

java 什么是控制字符的正则表达式？

java JLabel 点击事件

相关推荐

最近更新

标签