如何修复获取 URL 的 HTTP 错误。爬行时Java中的状态= 500?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21858701/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 11:06:31  来源:igfitidea点击:

how to fix HTTP error fetching URL. Status=500 in java while crawling?

javaweb-crawlerjsouphttp-error

提问by mOna

I am trying to crawl the user's ratings of cinema movies of imdb from the review page: (number of movies in my database is about 600,000). I used jsoup to parse pages as below: (sorry, I didn't write the whole code here since it is too long)

我正在尝试从评论页面抓取用户对 imdb 电影电影的评分:(我的数据库中的电影数量约为 600,000)。我用jsoup解析页面如下:(抱歉,代码太长没有写到这里)

try {
  //connecting to mysql db
  ResultSet res = st
        .executeQuery("SELECT id, title, production_year " +
                "FROM title " +
                "WHERE kind_id =1 " +
                "LIMIT 0 , 100000");
  while (res.next()){
       .......
       .......
     String baseUrl = "http://www.imdb.com/search/title?release_date=" +
            ""+year+","+year+"&title="+movieName+"" +
            "&title_type=feature,short,documentary,unknown";
    Document doc = Jsoup.connect(baseUrl)
            .userAgent("Mozilla")
            .timeout(0).get();
      .....
      ..... 
//insert ratings into database
      ...

I tested it for the first 100, then first 500 and also for the first 2000 movies in my db and it worked well. But the problem is that when I tested for 100,000 movies I got this error:

我测试了前 100 部电影,然后是前 500 部电影,还有我数据库中的前 2000 部电影,效果很好。但问题是,当我测试 100,000 部电影时,我得到了这个错误:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500,   URL=http://www.imdb.com/search/title?release_date=1899,1899&title='Columbia'%20Close%20to%20the%20Wind&title_type=feature,short,documentary,unknown
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at imdb.main(imdb.java:47)

I searched a lot for this error and I found it is a server side error with 5xx error number.

我搜索了很多这个错误,我发现它是一个带有 5xx 错误号的服务器端错误。

Then I decided to set a condition that when connection fails, it tries 2 more times and then if still couldn't connect, does not stop and goes to the next url. since I am new to java I tried to search for similar questions and read these answers in stackoverflow:

然后我决定设置一个条件,当连接失败时,它再尝试 2 次,如果仍然无法连接,则不停止并转到下一个 url。因为我是 Java 新手,所以我尝试搜索类似的问题并在 stackoverflow 中阅读这些答案:

Exceptions while I am extracting data from a Web site

从网站提取数据时出现异常

Jsoup error handling when couldn't connect to website

无法连接到网站时的 Jsoup 错误处理

Handling connection errors and JSoup

处理连接错误和 JSoup

but, when I try with "Connection.Response" as they suggest, it tells me that "Connection.Response cannot be resolved to a type".

但是,当我按照他们的建议尝试使用“Connection.Response”时,它告诉我“Connection.Response 无法解析为类型”。

I appreciate if someone could help me, since I am just a newbie and I know it might be simple but I don't know how to fix it.

如果有人可以帮助我,我很感激,因为我只是一个新手,我知道这可能很简单,但我不知道如何解决。



Well, I could fix the http error status 500 by just adding "ignoreHttpError(true)" as below:

好吧,我可以通过添加“ignoreHttpError(true)”来修复http错误状态500,如下所示:

org.jsoup.Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21");
con.timeout(180000).ignoreHttpErrors(true).followRedirects(true);
Response resp = con.execute();
Document doc = null;

if (resp.statusCode() == 200) {
    doc = con.get();
......

hope it can help those have the same error.

希望它可以帮助那些有同样错误的人。

however, after crawling review pages of 22907 movies (about 12 hours), I got another error:
"READ TIMED OUT".

然而,在抓取了 22907 部电影的评论页面(大约 12 小时)后,我又收到了另一个错误:
“READ TIMED OUT”。

I appreciate any suggestion to fix this error.

我感谢任何修复此错误的建议。

采纳答案by PopoFibo

Upgrading my comments to an answer:

将我的评论升级为答案:

Connection.Responseis org.jsoup.Connection.Response

Connection.Responseorg.jsoup.Connection.Response

To allow documentinstance only when there is a valid http code (200), break your call into 3 parts; Connection, Response, Document

document仅在存在有效的 http 代码 (200) 时才允许实例,请将调用分为 3 部分;Connection, Response,Document

Hence, your part of the code above gets modified to:

因此,您上面的代码部分被修改为:

while (res.next()){
       .......
       .......
       String baseUrl = "http://www.imdb.com/search/title?release_date=" + ""
                + year + "," + year + "&title=" + movieName + ""
                + "&title_type=feature,short,documentary,unknown";
       Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21").timeout(10000);
       Connection.Response resp = con.execute();
       Document doc = null;
        if (resp.statusCode() == 200) {
            doc = con.get();
                    ....
        }