Java Jsoup 404错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24475816/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 12:25:09  来源:igfitidea点击:

Jsoup 404 error

javahtmlconnectionhttp-status-code-404jsoup

提问by mawus

I am new with Jsoup but I can't understand why I receive a 404 error when trying to obtain a page, even if the page is accessible from browser and I don't use any proxys. I have tried with the following code:

我是 Jsoup 的新手,但我不明白为什么在尝试获取页面时会收到 404 错误,即使该页面可以从浏览器访问并且我不使用任何代理。我曾尝试使用以下代码:

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url).get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}

and I receive the exception message:

我收到异常消息:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at ro.pago.ucl2015.UCLWebParser.connect(UCLWebParser.java:27)
at ro.pago.ucl2015.UCLWebParser.main(UCLWebParser.java:16)

采纳答案by Alkis Kalogeris

It seems that the site doesn't allow bots and it will throw a 404 error response in case it doesn't locate the User-Agent headers. The below works as it sets the user agent headers

该站点似乎不允许机器人,如果它没有找到 User-Agent 标头,它将抛出 404 错误响应。以下内容在设置用户代理标头时起作用

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url)
               .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
               .referrer("http://www.google.com")              
               .get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}

User Agent

The Hypertext Transfer Protocol (HTTP) identifies the client software originating the request, using a "User-Agent" header, even when the client is not operated by a user.


Referrer(I don't think this is necessary)

HTTP referer (originally a misspelling of referrer) is an HTTP header field that identifies the address of the webpage (i.e. the URI or IRI) that linked to the resource being requested.

用户代理

超文本传输​​协议 (HTTP) 使用“User-Agent”标头识别发起请求的客户端软件,即使客户端不是由用户操作也是如此。


推荐人(我不认为这是必要的)

HTTP referer(最初是referrer 的拼写错误)是一个HTTP 标头字段,用于标识链接到所请求资源的网页地址(即URI 或IRI)。

Just to provide full service I would advise you to set the timeout period for your requests. The default is 3 seconds, if the server takes longer than that you will receive an exception. Bellow follows your code with timeout setter. Set it to zero for the longest possible period.

为了提供全面的服务,我建议您为您的请求设置超时期限。默认为 3 秒,如果服务器花费的时间超过此时间,您将收到异常。Bellow 使用超时设置器跟随您的代码。在尽可能长的时间内将其设置为零。

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url)
               .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
               .referrer("http://www.google.com") 
               .timeout(1000*5) //it's in milliseconds, so this means 5 seconds.              
               .get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
} 

回答by Udit Kapahi

If in case you are getting response code 404 , you can skip that url

如果您收到响应代码 404 ,则可以跳过该网址

Use ignoreHttpErrors(true), will surely solve your problem

使用ignoreHttpErrors(true),一定能解决你的问题

Document doc3 = null;
    try {
        doc3 = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                .referrer("http://www.google.com").ignoreHttpErrors(true).get();

    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }