Python 3 Web Scraping 中的 HTTP 错误 403

Question

提问by Josh

I was trying to scrap a website for practice, but I kept on getting the HTTP Error 403 (does it think I'm a bot)?

我试图废弃一个网站进行练习，但我一直收到 HTTP 错误 403（它认为我是机器人吗）？

Here is my code:

这是我的代码：

#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re

webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')

row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)

print(len(row_array))

iterator = []

The error I get is:

我得到的错误是：

 File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\lib\urllib\request.py", line 479, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 517, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Answer 1

采纳答案by Stefano Sanfilippo

This is probably because of mod_securityor some similar server security feature which blocks known spider/bot user agents (urllibuses something like python urllib/3.3.0, it's easily detected). Try setting a known browser user agent with:

这可能是因为mod_security或某些类似的服务器安全功能阻止了已知的蜘蛛/机器人用户代理（urllib使用类似的东西python urllib/3.3.0，很容易被检测到）。尝试设置一个已知的浏览器用户代理：

from urllib.request import Request, urlopen

req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

This works for me.

这对我有用。

By the way, in your code you are missing the ()after .readin the urlopenline, but I think that it's a typo.

顺便说一句，在你的代码中缺少()后.read的urlopen线，但我认为这是一个错字。

TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllibfor some reason...

提示：由于这是练习，请选择不同的非限制性站点。也许他们urllib出于某种原因阻止了......

Answer 2

回答by Robert Lujo

Since the page works in browser and not when calling within python program, it seems that the web app that serves that urlrecognizes that you request the content not by the browser.

由于该页面在浏览器中工作，而不是在 python 程序中调用时，似乎为该url提供服务的网络应用程序识别出您请求的内容不是由浏览器请求的。

Demonstration:

示范：

curl --dump-header r.txt http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1

...
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access ...
</HTML>

and the content in r.txt has status line:

并且 r.txt 中的内容具有状态行：

HTTP/1.1 403 Forbidden

Try posting header 'User-Agent' which fakesweb client.

尝试发布伪造Web 客户端的标题“User-Agent” 。

NOTE:The page contains Ajax call that creates the table you probably want to parse. You'll need to check the javascript logic of the page or simply using browser debugger (like Firebug / Net tab) to see which url you need to call to get the table's content.

注意：该页面包含创建您可能想要解析的表的 Ajax 调用。您需要检查页面的 javascript 逻辑或仅使用浏览器调试器（如 Firebug / Net 选项卡）来查看您需要调用哪个 url 来获取表的内容。

Answer 3

回答by zeta

Definitely it's blocking because of your use of urllib based on the user agent. This same thing is happening to me with OfferUp. You can create a new class called AppURLopener which overrides the user-agent with Mozilla.

肯定是因为您使用了基于用户代理的 urllib 而被阻塞。OfferUp 也发生在我身上。您可以创建一个名为 AppURLopener 的新类，它使用 Mozilla 覆盖用户代理。

import urllib.request

class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.open('http://httpbin.org/user-agent')

Source

来源

Answer 4

回答by royatirek

"This is probably because of mod_securityor some similar server security feature which blocks known

“这可能是因为mod_security或某些类似的服务器安全功能阻止了已知的

spider/bot

蜘蛛/机器人

user agents (urllib uses something like python urllib/3.3.0, it's easily detected)" - as already mentioned by Stefano Sanfilippo

用户代理（urllib 使用类似 python urllib/3.3.0 的东西，很容易被检测到）”——正如 Stefano Sanfilippo 已经提到的

from urllib.request import Request, urlopen
url="https://stackoverflow.com/search?q=html+error+403"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()

webpage = web_byte.decode('utf-8')

The web_byteis a byte object returned by the server and the content type present in webpage is mostly utf-8. Therefore you need to decode web_byteusing decode method.

该web_byte是由服务器和类型存在于网页中的内容返回的字节对象主要是UTF-8 。因此，您需要使用 decode 方法对web_byte进行解码。

This solves complete problem while I was having trying to scrap from a website using PyCharm

这解决了我尝试使用 PyCharm 从网站上报废时的完整问题

P.S -> I use python 3.4

PS - >我使用python 3.4

Answer 5

回答by Johnson

You can try in two ways. The detail is in this link.

您可以通过两种方式尝试。详细信息在此链接中。

1) Via pip

1）通过点子

pip install --upgrade certifi

2) If it doesn't work, try to run a Cerificates.commandthat comes bundled with Python 3.* for Mac:(Go to your python installation location and double click the file)

2) 如果它不起作用，请尝试运行Python 3.* 附带的Cerificates.commandfor Mac:（转到您的 Python 安装位置并双击该文件）

open /Applications/Python\ 3.*/Install\ Certificates.command

打开 /Applications/Python\ 3.*/Install\ Certificates.command

Answer 6

回答by user8316087

If you feel guilty about faking the user-agent as Mozilla (comment in the top answer from Stefano), it could work with a non-urllib User-Agent as well. This worked for the sites I reference:

如果您对将用户代理伪装成 Mozilla 感到内疚（Stefano 的最佳答案中的评论），它也可以与非 urllib 用户代理一起使用。这适用于我参考的网站：

    req = urlrequest.Request(link, headers={'User-Agent': 'XYZ/3.0'})
    urlrequest.urlopen(req, timeout=10).read()

My application is to test validity by scraping specific links that I refer to, in my articles. Not a generic scraper.

我的应用程序是通过抓取我在文章中引用的特定链接来测试有效性。不是通用的刮刀。

Answer 7

回答by Jonny_P

Based on previous answers this has worked for me with Python 3.7

根据以前的答案，这对我有用 Python 3.7

from urllib.request import Request, urlopen

req = Request('Url_Link', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()

print(webpage)

Answer 8

回答by VICTOR IWUOHA

Based on the previous answer,

根据之前的回答，

from urllib.request import Request, urlopen       
#specify url
url = 'https://xyz/xyz'
req = Request(url, headers={'User-Agent': 'XYZ/3.0'})
response = urlopen(req, timeout=20).read()

This worked for me by extending the timeout.

这通过延长超时对我有用。

Python 3 Web Scraping 中的 HTTP 错误 403

提问by Josh

采纳答案by Stefano Sanfilippo

回答by Robert Lujo

回答by zeta

回答by royatirek

回答by Johnson

回答by user8316087

回答by Jonny_P

回答by VICTOR IWUOHA

相关推荐

最近更新

标签

Python 3 Web Scraping 中的 HTTP 错误 403

提问by Josh

采纳答案by Stefano Sanfilippo

回答by Robert Lujo

回答by zeta

回答by royatirek

回答by Johnson

回答by user8316087

回答by Jonny_P

回答by VICTOR IWUOHA

相关推荐

Python：ConfigParser.NoSectionError：无部分：'TestInformation'

Python numpy 矩阵向量乘法

Python 如何仅将列表中的每个项目与其余项目进行比较？

Python 返回给定两点的直线方程的方法

相关推荐

最近更新

标签