Python:urllib.error.HTTPError:HTTP 错误 404:未找到
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42441211/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: urllib.error.HTTPError: HTTP Error 404: Not Found
提问by jophab
I wrote a script to find spelling mistakes in SO questions' titles. I used it for about a month.This was working fine.
我写了一个脚本来查找 SO 问题标题中的拼写错误。我用了大约一个月。这很好用。
But now, when I try to run it, I am getting this.
但是现在,当我尝试运行它时,我得到了这个。
Traceback (most recent call last):
File "copyeditor.py", line 32, in <module>
find_bad_qn(i)
File "copyeditor.py", line 15, in find_bad_qn
html = urlopen(url)
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 507, in error
return self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 587, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
This is my code
这是我的代码
import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
from enchant import DictWithPWL
from enchant.checker import SpellChecker
my_dict = DictWithPWL("en_US", pwl="terms.dict")
chkr = SpellChecker(lang=my_dict)
result = []
def find_bad_qn(a):
url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
html = urlopen(url)
bsObj = BeautifulSoup(html, "html5lib")
que = bsObj.find_all("div", class_="question-summary")
for div in que:
link = div.a.get('href')
name = div.a.text
chkr.set_text(name.lower())
list1 = []
for err in chkr:
list1.append(chkr.word)
if (len(list1) > 1):
str1 = ' '.join(list1)
result.append({'link': link, 'name': name, 'words': str1})
print("Please Wait.. it will take some time")
for i in range(298314,298346):
find_bad_qn(i)
for qn in result:
qn['link'] = "https://stackoverflow.com" + qn['link']
for qn in result:
print(qn['link'], " Error Words:", qn['words'])
url = qn['link']
UPDATE
更新
This is the url causing the problem.Even though this url exists.
这是导致问题的网址。即使此网址存在。
https://stackoverflow.com/questions?page=298314&sort=active
https://stackoverflow.com/questions?page=298314&sort=active
I tried changing the range to some lower values. It works fine now.
我尝试将范围更改为一些较低的值。它现在工作正常。
Why this happened with above url?
为什么上面的网址会发生这种情况?
回答by Atirag
So apparently the default display number of questions per page is 50 so the range you defined in the loop goes out of the available number of pages with 50 questions per page. The range should be adapted to be within the number of total pages with 50 questions each.
显然,每页的默认显示问题数是 50,因此您在循环中定义的范围超出了每页 50 个问题的可用页数。该范围应调整为在每页 50 个问题的总页数内。
This code will catch the 404 error which was the reason you got an error and ignore it just in case you go out of the range.
此代码将捕获 404 错误,这是您收到错误的原因,并在您超出范围时忽略它。
from urllib.request import urlopen
def find_bad_qn(a):
url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
try:
urlopen(url)
except:
pass
print("Please Wait.. it will take some time")
for i in range(298314,298346):
find_bad_qn(i)
回答by przemaz
I have exactly the same problem. The url that I want to get using urllib exists and is accessible using normal browser, but urllib is telling me 404.
我有完全一样的问题。我想使用 urllib 获取的 url 存在并且可以使用普通浏览器访问,但是 urllib 告诉我 404。
The solution for me is not use urllib:
我的解决方案是不使用 urllib:
import requests
requests.get(url)
This works for me.
这对我有用。
回答by Stevo
Try importing Request and append , headers={'User-Agent': 'Mozilla/5.0'}
to the end of your url.
尝试导入 Request 并附, headers={'User-Agent': 'Mozilla/5.0'}
加到 url 的末尾。
ie:
IE:
from urllib.request import Request, urlopen
from urllib.request import Request, urlopen
url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req)
html = urlopen(req)