Java Jsoup 404错误

Question

提问by mawus

I am new with Jsoup but I can't understand why I receive a 404 error when trying to obtain a page, even if the page is accessible from browser and I don't use any proxys. I have tried with the following code:

我是 Jsoup 的新手，但我不明白为什么在尝试获取页面时会收到 404 错误，即使该页面可以从浏览器访问并且我不使用任何代理。我曾尝试使用以下代码：

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url).get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}

and I receive the exception message:

我收到异常消息：

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at ro.pago.ucl2015.UCLWebParser.connect(UCLWebParser.java:27)
at ro.pago.ucl2015.UCLWebParser.main(UCLWebParser.java:16)

Answer 1

采纳答案by Alkis Kalogeris

It seems that the site doesn't allow bots and it will throw a 404 error response in case it doesn't locate the User-Agent headers. The below works as it sets the user agent headers

该站点似乎不允许机器人，如果它没有找到 User-Agent 标头，它将抛出 404 错误响应。以下内容在设置用户代理标头时起作用

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url)
               .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
               .referrer("http://www.google.com")              
               .get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}

User Agent
The Hypertext Transfer Protocol (HTTP) identifies the client software originating the request, using a "User-Agent" header, even when the client is not operated by a user.
Referrer(I don't think this is necessary)
HTTP referer (originally a misspelling of referrer) is an HTTP header field that identifies the address of the webpage (i.e. the URI or IRI) that linked to the resource being requested.

用户代理
超文本传输协议 (HTTP) 使用“User-Agent”标头识别发起请求的客户端软件，即使客户端不是由用户操作也是如此。
推荐人（我不认为这是必要的）
HTTP referer（最初是referrer 的拼写错误）是一个HTTP 标头字段，用于标识链接到所请求资源的网页地址（即URI 或IRI）。

Just to provide full service I would advise you to set the timeout period for your requests. The default is 3 seconds, if the server takes longer than that you will receive an exception. Bellow follows your code with timeout setter. Set it to zero for the longest possible period.

为了提供全面的服务，我建议您为您的请求设置超时期限。默认为 3 秒，如果服务器花费的时间超过此时间，您将收到异常。Bellow 使用超时设置器跟随您的代码。在尽可能长的时间内将其设置为零。

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url)
               .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
               .referrer("http://www.google.com") 
               .timeout(1000*5) //it's in milliseconds, so this means 5 seconds.              
               .get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}

Answer 2

回答by Udit Kapahi

If in case you are getting response code 404 , you can skip that url

如果您收到响应代码 404 ，则可以跳过该网址

Use ignoreHttpErrors(true), will surely solve your problem

使用ignoreHttpErrors(true)，一定能解决你的问题

Document doc3 = null;
    try {
        doc3 = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                .referrer("http://www.google.com").ignoreHttpErrors(true).get();

    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

Java Jsoup 404错误

提问by mawus

采纳答案by Alkis Kalogeris

回答by Udit Kapahi

相关推荐

最近更新

标签

Java Jsoup 404错误

提问by mawus

采纳答案by Alkis Kalogeris

回答by Udit Kapahi

相关推荐

Java Maven/Surefire 找不到要运行的测试

Java 如何在 Spark 中实现自定义作业侦听器/跟踪器？

Java 使 Maven 运行所有测试，即使有些测试失败

Java 如何获取集合的第 n 个元素

相关推荐

最近更新

标签