Python：urllib.error.HTTPError：HTTP 错误 404：未找到

Question

提问by jophab

I wrote a script to find spelling mistakes in SO questions' titles. I used it for about a month.This was working fine.

我写了一个脚本来查找 SO 问题标题中的拼写错误。我用了大约一个月。这很好用。

But now, when I try to run it, I am getting this.

但是现在，当我尝试运行它时，我得到了这个。

Traceback (most recent call last):
  File "copyeditor.py", line 32, in <module>
    find_bad_qn(i)
  File "copyeditor.py", line 15, in find_bad_qn
    html = urlopen(url)
  File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.4/urllib/request.py", line 469, in open
    response = meth(req, response)
  File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.4/urllib/request.py", line 507, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 587, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

This is my code

这是我的代码

import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
from enchant import DictWithPWL
from enchant.checker import SpellChecker

my_dict = DictWithPWL("en_US", pwl="terms.dict")
chkr = SpellChecker(lang=my_dict)
result = []


def find_bad_qn(a):
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
    html = urlopen(url)
    bsObj = BeautifulSoup(html, "html5lib")
    que = bsObj.find_all("div", class_="question-summary")
    for div in que:
        link = div.a.get('href')
        name = div.a.text
        chkr.set_text(name.lower())
        list1 = []
        for err in chkr:
            list1.append(chkr.word)
        if (len(list1) > 1):
            str1 = ' '.join(list1)
            result.append({'link': link, 'name': name, 'words': str1})


print("Please Wait.. it will take some time")
for i in range(298314,298346):
    find_bad_qn(i)
for qn in result:
    qn['link'] = "https://stackoverflow.com" + qn['link']
for qn in result:
    print(qn['link'], " Error Words:", qn['words'])
    url = qn['link']

UPDATE

更新

This is the url causing the problem.Even though this url exists.

这是导致问题的网址。即使此网址存在。

https://stackoverflow.com/questions?page=298314&sort=active

I tried changing the range to some lower values. It works fine now.

我尝试将范围更改为一些较低的值。它现在工作正常。

Why this happened with above url?

为什么上面的网址会发生这种情况？

Answer 1

回答by Atirag

So apparently the default display number of questions per page is 50 so the range you defined in the loop goes out of the available number of pages with 50 questions per page. The range should be adapted to be within the number of total pages with 50 questions each.

显然，每页的默认显示问题数是 50，因此您在循环中定义的范围超出了每页 50 个问题的可用页数。该范围应调整为在每页 50 个问题的总页数内。

This code will catch the 404 error which was the reason you got an error and ignore it just in case you go out of the range.

此代码将捕获 404 错误，这是您收到错误的原因，并在您超出范围时忽略它。

from urllib.request import urlopen

def find_bad_qn(a):
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
    try:
        urlopen(url)
    except:
        pass

print("Please Wait.. it will take some time")
for i in range(298314,298346):
    find_bad_qn(i)

Answer 2

回答by przemaz

I have exactly the same problem. The url that I want to get using urllib exists and is accessible using normal browser, but urllib is telling me 404.

我有完全一样的问题。我想使用 urllib 获取的 url 存在并且可以使用普通浏览器访问，但是 urllib 告诉我 404。

The solution for me is not use urllib:

我的解决方案是不使用 urllib：

import requests
requests.get(url)

This works for me.

这对我有用。

Answer 3

回答by Stevo

Try importing Request and append , headers={'User-Agent': 'Mozilla/5.0'}to the end of your url.

尝试导入 Request 并附, headers={'User-Agent': 'Mozilla/5.0'}加到 url 的末尾。

ie:

IE：

from urllib.request import Request, urlopen

url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

html = urlopen(req)

Python：urllib.error.HTTPError：HTTP 错误 404：未找到

提问by jophab

回答by Atirag

回答by przemaz

回答by Stevo

相关推荐

最近更新

标签

Python：urllib.error.HTTPError：HTTP 错误 404：未找到

提问by jophab

回答by Atirag

回答by przemaz

回答by Stevo

相关推荐

Python 带有 base 10 错误的 long() 的熊猫无效文字

如何在python中使用matplotlib和pandas绘制CSV数据

Python 如何比较熊猫中的两列以制作第三列？

Python 如何在没有索引的情况下转置熊猫中的数据帧？

相关推荐

最近更新

标签