Python urllib3 以及如何处理 cookie 支持?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2422922/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python urllib3 and how to handle cookie support?
提问by bigredbob
So I'm looking into urllib3because it has connection pooling and is thread safe (so performance is better, especially for crawling), but the documentation is... minimal to say the least. urllib2 has build_opener so something like:
所以我正在研究urllib3,因为它具有连接池并且是线程安全的(因此性能更好,特别是对于爬网),但文档是......至少可以说是最少的。urllib2 有 build_opener 所以类似:
#!/usr/bin/python
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/")
But urllib3 has no build_opener method, so the only way I have figured out so far is to manually put it in the header:
但是 urllib3 没有 build_opener 方法,所以到目前为止我想出的唯一方法是手动将它放在标题中:
#!/usr/bin/python
import urllib3
http_pool = urllib3.connection_from_url("http://example.com")
myheaders = {'Cookie':'some cookie data'}
r = http_pool.get_url("http://example.org/", headers=myheaders)
But I am hoping there is a better way and that one of you can tell me what it is. Also can someone tag this with "urllib3" please.
但我希望有更好的方法,你们中的一个人可以告诉我它是什么。也可以有人用“urllib3”标记这个。
回答by shazow
You're correct, there's no immediately better way to do this right now. I would be more than happy to accept a patch if you have a congruent improvement.
你是对的,现在没有更好的方法来做到这一点。如果您有一致的改进,我将非常乐意接受补丁。
One thing to keep in mind, urllib3's HTTPConnectionPool is intended to be a "pool of connections" to a specific host, as opposed to a stateful client. In that context, it makes sense to keep the tracking of cookies outside of the actual pool.
要记住的一件事是,urllib3 的 HTTPConnectionPool 旨在成为特定主机的“连接池”,而不是有状态的客户端。在这种情况下,将 cookie 的跟踪保持在实际池之外是有意义的。
- shazow (the author of urllib3)
- shazow(urllib3 的作者)
回答by Rod Montgomery
Is there not a problem with multiple cookies?
多个cookies没有问题吗?
Some servers return multiple Set-Cookie headers, but urllib3 stores the headers in a dict and a dict does not allow multiple entries with the same key.
一些服务器返回多个 Set-Cookie 标头,但 urllib3 将标头存储在 dict 中,并且 dict 不允许具有相同键的多个条目。
httplib2 has a similar problem.
httplib2 也有类似的问题。
Or maybe not: it turns out that the readheaders method of the HTTPMessage class in the httplib package -- which both urllib3 and httplib2 use -- has the following comment:
或者可能不是:事实证明,httplib 包中 HTTPMessage 类的 readheaders 方法(urllib3 和 httplib2 都使用)具有以下注释:
If multiple header fields with the same name occur, they are combined according to the rules in RFC 2616 sec 4.2:
如果出现多个同名的头字段,它们会根据 RFC 2616 sec 4.2 中的规则进行组合:
Appending each subsequent field-value to the first, each separated
by a comma. The order in which header fields with the same field-name
are received is significant to the interpretation of the combined
field value.
So no headers are lost.
所以没有标题丢失。
There is, however, a problem if there are commas within a header value. I have not yet figured out what is going on here, but from skimming RFC 2616 ("Hypertext Transfer Protocol -- HTTP/1.1") and RFC 2965 ("HTTP State Management Mechanism") I get the impression that any commas within a header value are supposed to be quoted.
但是,如果标题值中有逗号,则会出现问题。我还没有弄清楚这里发生了什么,但是从略读 RFC 2616(“超文本传输协议 -- HTTP/1.1”)和 RFC 2965(“HTTP 状态管理机制”)我得到的印象是标题中的任何逗号值应该被引用。
回答by rd108
You should use the requests library. It uses urllib3 but makes things like adding cookies trivial.
您应该使用请求库。它使用 urllib3 但使添加 cookie 之类的事情变得微不足道。
https://github.com/kennethreitz/requests
https://github.com/kennethreitz/requests
import requests
r1 = requests.get(url, cookies={'somename':'somevalue'})
print(r1.content)
回答by YOU
You need to set 'Cookie'
not 'Set-Cookie'
, 'Set-Cookie'
set by web server.
您需要设置'Cookie'
not 'Set-Cookie'
,'Set-Cookie'
由网络服务器设置。
And Cookies are one of headers, so its nothing wrong with doing that way.
并且 Cookie 是标题之一,因此这样做并没有错。
回答by Adrian B
You can use a code like this:
您可以使用这样的代码:
def getHtml(url):
http = urllib3.PoolManager()
r = http.request('GET', url, headers={'User-agent':'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.16 Safari/537.36','Cookie':'cookie_name=cookie_value'})
return r.data #HTML
You should replace cookie_name and cookie_value
你应该替换 cookie_name 和 cookie_value