Python 3 Web Scraping 中的 HTTP 错误 403
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/16627227/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HTTP error 403 in Python 3 Web Scraping
提问by Josh
I was trying to scrap a website for practice, but I kept on getting the HTTP Error 403 (does it think I'm a bot)?
我试图废弃一个网站进行练习,但我一直收到 HTTP 错误 403(它认为我是机器人吗)?
Here is my code:
这是我的代码:
#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re
webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')
row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)
print(len(row_array))
iterator = []
The error I get is:
我得到的错误是:
 File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\lib\urllib\request.py", line 479, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 517, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
采纳答案by Stefano Sanfilippo
This is probably because of mod_securityor some similar server security feature which blocks known spider/bot user agents (urllibuses something like python urllib/3.3.0, it's easily detected). Try setting a known browser user agent with:
这可能是因为mod_security或某些类似的服务器安全功能阻止了已知的蜘蛛/机器人用户代理(urllib使用类似的东西python urllib/3.3.0,很容易被检测到)。尝试设置一个已知的浏览器用户代理:
from urllib.request import Request, urlopen
req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
This works for me.
这对我有用。
By the way, in your code you are missing the ()after .readin the urlopenline, but I think that it's a typo.
顺便说一句,在你的代码中缺少()后.read的urlopen线,但我认为这是一个错字。
TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllibfor some reason...
提示:由于这是练习,请选择不同的非限制性站点。也许他们urllib出于某种原因阻止了......
回答by Robert Lujo
Since the page works in browser and not when calling within python program, it seems that the web app that serves that urlrecognizes that you request the content not by the browser.
由于该页面在浏览器中工作,而不是在 python 程序中调用时,似乎为该url提供服务的网络应用程序识别出您请求的内容不是由浏览器请求的。
Demonstration:
示范:
curl --dump-header r.txt http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1
...
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access ...
</HTML>
and the content in r.txt has status line:
并且 r.txt 中的内容具有状态行:
HTTP/1.1 403 Forbidden
Try posting header 'User-Agent' which fakesweb client.
尝试发布伪造Web 客户端的标题“User-Agent” 。
NOTE:The page contains Ajax call that creates the table you probably want to parse. You'll need to check the javascript logic of the page or simply using browser debugger (like Firebug / Net tab) to see which url you need to call to get the table's content.
注意:该页面包含创建您可能想要解析的表的 Ajax 调用。您需要检查页面的 javascript 逻辑或仅使用浏览器调试器(如 Firebug / Net 选项卡)来查看您需要调用哪个 url 来获取表的内容。
回答by zeta
Definitely it's blocking because of your use of urllib based on the user agent. This same thing is happening to me with OfferUp. You can create a new class called AppURLopener which overrides the user-agent with Mozilla.
肯定是因为您使用了基于用户代理的 urllib 而被阻塞。OfferUp 也发生在我身上。您可以创建一个名为 AppURLopener 的新类,它使用 Mozilla 覆盖用户代理。
import urllib.request
class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"
opener = AppURLopener()
response = opener.open('http://httpbin.org/user-agent')
回答by royatirek
"This is probably because of mod_securityor some similar server security feature which blocks known
“这可能是因为mod_security或某些类似的服务器安全功能阻止了已知的
spider/bot
蜘蛛/机器人
user agents (urllib uses something like python urllib/3.3.0, it's easily detected)" - as already mentioned by Stefano Sanfilippo
用户代理(urllib 使用类似 python urllib/3.3.0 的东西,很容易被检测到)”——正如 Stefano Sanfilippo 已经提到的
from urllib.request import Request, urlopen
url="https://stackoverflow.com/search?q=html+error+403"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
The web_byteis a byte object returned by the server and the content type present in webpage is mostly utf-8. Therefore you need to decode web_byteusing decode method.
该web_byte是由服务器和类型存在于网页中的内容返回的字节对象主要是UTF-8 。因此,您需要使用 decode 方法对web_byte进行解码。
This solves complete problem while I was having trying to scrap from a website using PyCharm
这解决了我尝试使用 PyCharm 从网站上报废时的完整问题
P.S -> I use python 3.4
PS - >我使用python 3.4
回答by Johnson
You can try in two ways. The detail is in this link.
您可以通过两种方式尝试。详细信息在此链接中。
1) Via pip
1)通过点子
pip install --upgrade certifi
pip install --upgrade certifi
2) If it doesn't work, try to run a Cerificates.commandthat comes bundled with Python 3.* for Mac:(Go to your python installation location and double click the file)
2) 如果它不起作用,请尝试运行Python 3.* 附带的Cerificates.commandfor Mac:(转到您的 Python 安装位置并双击该文件)
open /Applications/Python\ 3.*/Install\ Certificates.command
打开 /Applications/Python\ 3.*/Install\ Certificates.command
回答by user8316087
If you feel guilty about faking the user-agent as Mozilla (comment in the top answer from Stefano), it could work with a non-urllib User-Agent as well. This worked for the sites I reference:
如果您对将用户代理伪装成 Mozilla 感到内疚(Stefano 的最佳答案中的评论),它也可以与非 urllib 用户代理一起使用。这适用于我参考的网站:
    req = urlrequest.Request(link, headers={'User-Agent': 'XYZ/3.0'})
    urlrequest.urlopen(req, timeout=10).read()
My application is to test validity by scraping specific links that I refer to, in my articles. Not a generic scraper.
我的应用程序是通过抓取我在文章中引用的特定链接来测试有效性。不是通用的刮刀。
回答by Jonny_P
Based on previous answers this has worked for me with Python 3.7
根据以前的答案,这对我有用 Python 3.7
from urllib.request import Request, urlopen
req = Request('Url_Link', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()
print(webpage)
回答by VICTOR IWUOHA
Based on the previous answer,
根据之前的回答,
from urllib.request import Request, urlopen       
#specify url
url = 'https://xyz/xyz'
req = Request(url, headers={'User-Agent': 'XYZ/3.0'})
response = urlopen(req, timeout=20).read()
This worked for me by extending the timeout.
这通过延长超时对我有用。

