如何避免 HTTP 错误 429（请求过多）python

Question

提问by Aous1000

I am trying to use Python to login to a website and gather information from several webpages and I get the following error:

我正在尝试使用 Python 登录网站并从多个网页收集信息，但出现以下错误：

Traceback (most recent call last):
  File "extract_test.py", line 43, in <module>
    response=br.open(v)
  File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
    return self._mech_open(url, data, timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
    raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 429: Unknown Response Code

Traceback (most recent call last):
  File "extract_test.py", line 43, in <module>
    response=br.open(v)
  File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
    return self._mech_open(url, data, timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
    raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 429: Unknown Response Code

I used time.sleep()and it works, but it seems unintelligent and unreliable, is there any other way to dodge this error?

我用过time.sleep()，它有效，但它似乎不智能且不可靠，有没有其他方法可以避免这个错误？

Here's my code:

这是我的代码：

import mechanize
import cookielib
import re
first=("example.com/page1")
second=("example.com/page2")
third=("example.com/page3")
fourth=("example.com/page4")
## I have seven URL's I want to open

urls_list=[first,second,third,fourth]

br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options 
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Log in credentials
br.open("example.com")
br.select_form(nr=0)
br["username"] = "username"
br["password"] = "password"
br.submit()

for url in urls_list:
        br.open(url)
        print re.findall("Some String")

Answer 1

回答by Gaurav Agarwal

Another workaround would be to spoof your IP using some sort of Public VPN or Tor network. This would be assuming the rate-limiting on the server at IP level.

另一种解决方法是使用某种公共 VPN 或 Tor 网络来欺骗您的 IP。这将假设在 IP 级别对服务器进行速率限制。

There is a brief blog post demonstrating a way to use tor along with urllib2:

有一篇简短的博客文章展示了一种将 Tor 与 urllib2 一起使用的方法：

http://blog.flip-edesign.com/?p=119

Answer 2

回答by MRA

Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.

收到状态 429不是错误，而是另一台服务器“亲切地”要求您停止发送垃圾邮件请求。很明显，你的请求率太高了，服务器不愿意接受。

You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.

您不应该试图“躲避”这一点，甚至不应该试图通过欺骗您的 IP 来规避服务器安全设置，您应该通过不要发送太多请求来尊重服务器的回答。

If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.

如果一切设置正确，您还将收到“Retry-after”标头以及 429 响应。此标头指定在进行另一个调用之前应等待的秒数。处理这个“问题”的正确方法是阅读这个标题并让你的进程休眠几秒钟。

You can find more information on status 429 here: http://tools.ietf.org/html/rfc6585#page-3

您可以在此处找到有关状态 429 的更多信息：http: //tools.ietf.org/html/rfc6585#page-3

Answer 3

回答by tadm123

Writing this piece of code fixed my problem:

编写这段代码解决了我的问题：

requests.get(link, headers = {'User-agent': 'your bot 0.1'})

Answer 4

回答by psaniko

As MRA said, you shouldn't try to dodge a 429 Too Many Requestsbut instead handle it accordingly. You have several options depending on your use-case:

正如 MRA 所说，你不应该试图躲避 a 429 Too Many Requests，而是相应地处理它。根据您的用例，您有多种选择：

1) Sleep your process. The server usually includes a Retry-afterheader in the response with the number of seconds you are supposed to wait before retrying. Keep in mind that sleeping a process might cause problems, e.g. in a task queue, where you should instead retry the task at a later time to free up the worker for other things.

1）睡眠你的过程。服务器通常Retry-after在响应中包含一个标头，其中包含您在重试之前应该等待的秒数。请记住，休眠进程可能会导致问题，例如在任务队列中，您应该在稍后重试该任务以释放工作人员做其他事情。

2) Exponential backoff. If the server does not tell you how long to wait, you can retry your request using increasing pauses in between. The popular task queue Celery has this feature built right-in.

2)指数退避。如果服务器没有告诉您等待多长时间，您可以使用增加之间的暂停来重试您的请求。流行的任务队列 Celery内置了此功能。

3) Token bucket. This technique is useful if you know in advance how many requests you are able to make in a given time. Each time you access the API you first fetch a token from the bucket. The bucket is refilled at a constant rate. If the bucket is empty, you know you'll have to wait before hitting the API again. Token buckets are usually implemented on the other end (the API) but you can also use them as a proxy to avoid ever getting a 429 Too Many Requests. Celery's rate_limitfeature uses a token bucket algorithm.

3）令牌桶。如果您事先知道在给定时间内可以发出多少请求，则此技术很有用。每次访问 API 时，您首先从存储桶中获取令牌。桶以恒定速率重新填充。如果存储桶为空，您知道在再次访问 API 之前必须等待。令牌桶通常在另一端（API）实现，但您也可以将它们用作代理以避免获取429 Too Many Requests. Celery 的rate_limit功能使用令牌桶算法。

Here is an example of a Python/Celery app using exponential backoff and rate-limiting/token bucket:

这是一个使用指数退避和速率限制/令牌桶的 Python/Celery 应用程序示例：

class TooManyRequests(Exception):
"""Too many requests"""

@task(
   rate_limit='10/s',
   autoretry_for=(ConnectTimeout, TooManyRequests,),
   retry_backoff=True)
def api(*args, **kwargs):
  r = requests.get('placeholder-external-api')

  if r.status_code == 429:
    raise TooManyRequests()

如何避免 HTTP 错误 429（请求过多）python

提问by Aous1000

回答by Gaurav Agarwal

回答by MRA

回答by tadm123

回答by psaniko

相关推荐

最近更新

标签

如何避免 HTTP 错误 429（请求过多）python

提问by Aous1000

回答by Gaurav Agarwal

回答by MRA

回答by tadm123

回答by psaniko

相关推荐

Python TypeError：不支持的操作数类型/：'NoneType'和'float'

Python - 如果字符串包含列表或集合中的单词

Python NLTK 停用词列表

Python boto，列出bucket中特定目录的内容

相关推荐

最近更新

标签