Java Jsoup 404错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24475816/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Jsoup 404 error
提问by mawus
I am new with Jsoup but I can't understand why I receive a 404 error when trying to obtain a page, even if the page is accessible from browser and I don't use any proxys. I have tried with the following code:
我是 Jsoup 的新手,但我不明白为什么在尝试获取页面时会收到 404 错误,即使该页面可以从浏览器访问并且我不使用任何代理。我曾尝试使用以下代码:
private static Document connect() {
String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (HttpStatusException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
and I receive the exception message:
我收到异常消息:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at ro.pago.ucl2015.UCLWebParser.connect(UCLWebParser.java:27)
at ro.pago.ucl2015.UCLWebParser.main(UCLWebParser.java:16)
采纳答案by Alkis Kalogeris
It seems that the site doesn't allow bots and it will throw a 404 error response in case it doesn't locate the User-Agent headers. The below works as it sets the user agent headers
该站点似乎不允许机器人,如果它没有找到 User-Agent 标头,它将抛出 404 错误响应。以下内容在设置用户代理标头时起作用
private static Document connect() {
String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
Document doc = null;
try {
doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.referrer("http://www.google.com")
.get();
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (HttpStatusException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
User Agent
The Hypertext Transfer Protocol (HTTP) identifies the client software originating the request, using a "User-Agent" header, even when the client is not operated by a user.
Referrer(I don't think this is necessary)
HTTP referer (originally a misspelling of referrer) is an HTTP header field that identifies the address of the webpage (i.e. the URI or IRI) that linked to the resource being requested.
用户代理
超文本传输协议 (HTTP) 使用“User-Agent”标头识别发起请求的客户端软件,即使客户端不是由用户操作也是如此。
推荐人(我不认为这是必要的)
HTTP referer(最初是referrer 的拼写错误)是一个HTTP 标头字段,用于标识链接到所请求资源的网页地址(即URI 或IRI)。
Just to provide full service I would advise you to set the timeout period for your requests. The default is 3 seconds, if the server takes longer than that you will receive an exception. Bellow follows your code with timeout setter. Set it to zero for the longest possible period.
为了提供全面的服务,我建议您为您的请求设置超时期限。默认为 3 秒,如果服务器花费的时间超过此时间,您将收到异常。Bellow 使用超时设置器跟随您的代码。在尽可能长的时间内将其设置为零。
private static Document connect() {
String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
Document doc = null;
try {
doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.referrer("http://www.google.com")
.timeout(1000*5) //it's in milliseconds, so this means 5 seconds.
.get();
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (HttpStatusException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
回答by Udit Kapahi
If in case you are getting response code 404 , you can skip that url
如果您收到响应代码 404 ,则可以跳过该网址
Use ignoreHttpErrors(true), will surely solve your problem
使用ignoreHttpErrors(true),一定能解决你的问题
Document doc3 = null;
try {
doc3 = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.referrer("http://www.google.com").ignoreHttpErrors(true).get();
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}