Python:urllib.error.HTTPError:HTTP 错误 404:未找到

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42441211/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:41:13  来源:igfitidea点击:

Python: urllib.error.HTTPError: HTTP Error 404: Not Found

pythonpython-3.xurllib

提问by jophab

I wrote a script to find spelling mistakes in SO questions' titles. I used it for about a month.This was working fine.

我写了一个脚本来查找 SO 问题标题中的拼写错误。我用了大约一个月。这很好用。

But now, when I try to run it, I am getting this.

但是现在,当我尝试运行它时,我得到了这个。

Traceback (most recent call last):
  File "copyeditor.py", line 32, in <module>
    find_bad_qn(i)
  File "copyeditor.py", line 15, in find_bad_qn
    html = urlopen(url)
  File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.4/urllib/request.py", line 469, in open
    response = meth(req, response)
  File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.4/urllib/request.py", line 507, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 587, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

This is my code

这是我的代码

import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
from enchant import DictWithPWL
from enchant.checker import SpellChecker

my_dict = DictWithPWL("en_US", pwl="terms.dict")
chkr = SpellChecker(lang=my_dict)
result = []


def find_bad_qn(a):
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
    html = urlopen(url)
    bsObj = BeautifulSoup(html, "html5lib")
    que = bsObj.find_all("div", class_="question-summary")
    for div in que:
        link = div.a.get('href')
        name = div.a.text
        chkr.set_text(name.lower())
        list1 = []
        for err in chkr:
            list1.append(chkr.word)
        if (len(list1) > 1):
            str1 = ' '.join(list1)
            result.append({'link': link, 'name': name, 'words': str1})


print("Please Wait.. it will take some time")
for i in range(298314,298346):
    find_bad_qn(i)
for qn in result:
    qn['link'] = "https://stackoverflow.com" + qn['link']
for qn in result:
    print(qn['link'], " Error Words:", qn['words'])
    url = qn['link']

UPDATE

更新

This is the url causing the problem.Even though this url exists.

这是导致问题的网址。即使此网址存在。

https://stackoverflow.com/questions?page=298314&sort=active

https://stackoverflow.com/questions?page=298314&sort=active

I tried changing the range to some lower values. It works fine now.

我尝试将范围更改为一些较低的值。它现在工作正常。

Why this happened with above url?

为什么上面的网址会发生这种情况?

回答by Atirag

So apparently the default display number of questions per page is 50 so the range you defined in the loop goes out of the available number of pages with 50 questions per page. The range should be adapted to be within the number of total pages with 50 questions each.

显然,每页的默认显示问题数是 50,因此您在循环中定义的范围超出了每页 50 个问题的可用页数。该范围应调整为在每页 50 个问题的总页数内。

This code will catch the 404 error which was the reason you got an error and ignore it just in case you go out of the range.

此代码将捕获 404 错误,这是您收到错误的原因,并在您超出范围时忽略它。

from urllib.request import urlopen

def find_bad_qn(a):
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
    try:
        urlopen(url)
    except:
        pass

print("Please Wait.. it will take some time")
for i in range(298314,298346):
    find_bad_qn(i)

回答by przemaz

I have exactly the same problem. The url that I want to get using urllib exists and is accessible using normal browser, but urllib is telling me 404.

我有完全一样的问题。我想使用 urllib 获取的 url 存在并且可以使用普通浏览器访问,但是 urllib 告诉我 404。

The solution for me is not use urllib:

我的解决方案是不使用 urllib:

import requests
requests.get(url)

This works for me.

这对我有用。

回答by Stevo

Try importing Request and append , headers={'User-Agent': 'Mozilla/5.0'}to the end of your url.

尝试导入 Request 并附, headers={'User-Agent': 'Mozilla/5.0'}加到 url 的末尾。

ie:

IE:

from urllib.request import Request, urlopen

from urllib.request import Request, urlopen

url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"

url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

html = urlopen(req)

html = urlopen(req)