python 如何在 urllib2 请求中获得默认标头?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/603856/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do you get default headers in a urllib2 Request?
提问by Corey Goldberg
I have a Python web client that uses urllib2. It is easy enough to add HTTP headers to my outgoing requests. I just create a dictionary of the headers I want to add, and pass it to the Request initializer.
我有一个使用 urllib2 的 Python Web 客户端。将 HTTP 标头添加到我的传出请求中很容易。我只是创建了一个我想要添加的标头的字典,并将它传递给 Request 初始值设定项。
However, other "standard" HTTP headers get added to the request as well as the custom ones I explicitly add. When I sniff the request using Wireshark, I see headers besides the ones I add myself. My question is how do a I get access to these headers? I want to log every request (including the fullset of HTTP headers), and can't figure out how.
但是,其他“标准”HTTP 标头以及我明确添加的自定义标头都会添加到请求中。当我使用 Wireshark 嗅探请求时,除了我自己添加的标头之外,我还会看到标头。我的问题是如何访问这些标题?我想记录每个请求(包括完整的 HTTP 标头集),但不知道如何记录。
any pointers?
任何指针?
in a nutshell: How do I get all the outgoing headers from an HTTP request created by urllib2?
简而言之:如何从 urllib2 创建的 HTTP 请求中获取所有传出标头?
采纳答案by Brandon Rhodes
If you want to see the literal HTTP request that is sent out, and therefore see every last header exactly as it is represented on the wire, then you can tell urllib2
to use your own version of an HTTPHandler
that prints out (or saves, or whatever) the outgoing HTTP request.
如果您想查看发送出去的文字 HTTP 请求,并因此查看与线路上表示的完全一样的每个最后一个标头,那么您可以告诉urllib2
使用您自己的HTTPHandler
打印输出(或保存,或其他)的版本传出的 HTTP 请求。
import httplib, urllib2
class MyHTTPConnection(httplib.HTTPConnection):
def send(self, s):
print s # or save them, or whatever!
httplib.HTTPConnection.send(self, s)
class MyHTTPHandler(urllib2.HTTPHandler):
def http_open(self, req):
return self.do_open(MyHTTPConnection, req)
opener = urllib2.build_opener(MyHTTPHandler)
response = opener.open('http://www.google.com/')
The result of running this code is:
运行这段代码的结果是:
GET / HTTP/1.1
Accept-Encoding: identity
Host: www.google.com
Connection: close
User-Agent: Python-urllib/2.6
回答by Jarret Hardie
The urllib2 library uses OpenerDirector objects to handle the actual opening. Fortunately, the python library provides defaults so you don't have to. It is, however, these OpenerDirector objects that are adding the extra headers.
urllib2 库使用 OpenerDirector 对象来处理实际的打开。幸运的是,python 库提供了默认值,因此您不必这样做。然而,这些 OpenerDirector 对象正在添加额外的标题。
To see what they are after the request has been sent (so that you can log it, for example):
要在发送请求后查看它们是什么(例如,以便您可以记录它):
req = urllib2.Request(url='http://google.com')
response = urllib2.urlopen(req)
print req.unredirected_hdrs
(produces {'Host': 'google.com', 'User-agent': 'Python-urllib/2.5'} etc)
The unredirected_hdrs is where the OpenerDirectors dump their extra headers. Simply looking at req.headers
will show only your own headers - the library leaves those unmolested for you.
unredirected_hdrs 是 OpenerDirectors 转储额外头文件的地方。只需查看req.headers
将只显示您自己的标题 - 库会为您保留那些不受干扰的标题。
If you need to see the headers before you send the request, you'll need to subclass the OpenerDirector in order to intercept the transmission.
如果您需要在发送请求之前查看标头,则需要对 OpenerDirector 进行子类化以拦截传输。
Hope that helps.
希望有帮助。
EDIT: I forgot to mention that, once the request as been sent, req.header_items()
will give you a list of tuples of ALL the headers, with both your own and the ones added by the OpenerDirector. I should have mentioned this first since it's the most straightforward :-) Sorry.
编辑:我忘了提到,一旦请求被发送,req.header_items()
就会给你一个所有标题的元组列表,包括你自己的和 OpenerDirector 添加的。我应该首先提到这一点,因为它是最直接的 :-) 抱歉。
EDIT 2: After your question about an example for defining your own handler, here's the sample I came up with. The concern in any monkeying with the request chain is that we need to be sure that the handler is safe for multiple requests, which is why I'm uncomfortable just replacing the definition of putheader on the HTTPConnection class directly.
编辑 2:在您关于定义自己的处理程序的示例的问题之后,这是我想出的示例。对请求链进行任何玩弄的问题是,我们需要确保处理程序对于多个请求是安全的,这就是为什么我不喜欢直接替换 HTTPConnection 类上的 putheader 定义的原因。
Sadly, because the internals of HTTPConnection and the AbstractHTTPHandler are very internal, we have to reproduce much of the code from the python library to inject our custom behaviour. Assuming I've not goofed below and this works as well as it did in my 5 minutes of testing, please be careful to revisit this override if you update your Python version to a revision number (ie: 2.5.x to 2.5.y or 2.5 to 2.6, etc).
遗憾的是,因为 HTTPConnection 和 AbstractHTTPHandler 的内部结构非常内部,我们必须从 python 库中复制大部分代码来注入我们的自定义行为。假设我没有在下面犯错并且这在我 5 分钟的测试中工作得很好,如果您将 Python 版本更新为修订号(即:2.5.x 到 2.5.y 或2.5 到 2.6 等)。
I should therefore mention that I am on Python 2.5.1. If you have 2.6 or, particularly, 3.0, you may need to adjust this accordingly.
因此,我应该提到我使用的是 Python 2.5.1。如果您有 2.6 或特别是 3.0,您可能需要相应地调整它。
Please let me know if this doesn't work. I'm having waaaayyyy too much fun with this question:
如果这不起作用,请告诉我。我对这个问题太感兴趣了:
import urllib2
import httplib
import socket
class CustomHTTPConnection(httplib.HTTPConnection):
def __init__(self, *args, **kwargs):
httplib.HTTPConnection.__init__(self, *args, **kwargs)
self.stored_headers = []
def putheader(self, header, value):
self.stored_headers.append((header, value))
httplib.HTTPConnection.putheader(self, header, value)
class HTTPCaptureHeaderHandler(urllib2.AbstractHTTPHandler):
def http_open(self, req):
return self.do_open(CustomHTTPConnection, req)
http_request = urllib2.AbstractHTTPHandler.do_request_
def do_open(self, http_class, req):
# All code here lifted directly from the python library
host = req.get_host()
if not host:
raise URLError('no host given')
h = http_class(host) # will parse host:port
h.set_debuglevel(self._debuglevel)
headers = dict(req.headers)
headers.update(req.unredirected_hdrs)
headers["Connection"] = "close"
headers = dict(
(name.title(), val) for name, val in headers.items())
try:
h.request(req.get_method(), req.get_selector(), req.data, headers)
r = h.getresponse()
except socket.error, err: # XXX what error?
raise urllib2.URLError(err)
r.recv = r.read
fp = socket._fileobject(r, close=True)
resp = urllib2.addinfourl(fp, r.msg, req.get_full_url())
resp.code = r.status
resp.msg = r.reason
# This is the line we're adding
req.all_sent_headers = h.stored_headers
return resp
my_handler = HTTPCaptureHeaderHandler()
opener = urllib2.OpenerDirector()
opener.add_handler(my_handler)
req = urllib2.Request(url='http://www.google.com')
resp = opener.open(req)
print req.all_sent_headers
shows: [('Accept-Encoding', 'identity'), ('Host', 'www.google.com'), ('Connection', 'close'), ('User-Agent', 'Python-urllib/2.5')]
回答by Justus
How about something like this:
这样的事情怎么样:
import urllib2
import httplib
old_putheader = httplib.HTTPConnection.putheader
def putheader(self, header, value):
print header, value
old_putheader(self, header, value)
httplib.HTTPConnection.putheader = putheader
urllib2.urlopen('http://www.google.com')
回答by jedie
A low-level solution:
低级解决方案:
import httplib
class HTTPConnection2(httplib.HTTPConnection):
def __init__(self, *args, **kwargs):
httplib.HTTPConnection.__init__(self, *args, **kwargs)
self._request_headers = []
self._request_header = None
def putheader(self, header, value):
self._request_headers.append((header, value))
httplib.HTTPConnection.putheader(self, header, value)
def send(self, s):
self._request_header = s
httplib.HTTPConnection.send(self, s)
def getresponse(self, *args, **kwargs):
response = httplib.HTTPConnection.getresponse(self, *args, **kwargs)
response.request_headers = self._request_headers
response.request_header = self._request_header
return response
Example:
例子:
conn = HTTPConnection2("www.python.org")
conn.request("GET", "/index.html", headers={
"User-agent": "test",
"Referer": "/",
})
response = conn.getresponse()
response.status, response.reason:
response.status, response.reason:
1: 200 OK
response.request_headers:
response.request_headers:
[('Host', 'www.python.org'), ('Accept-Encoding', 'identity'), ('Referer', '/'), ('User-agent', 'test')]
response.request_header:
response.request_header:
GET /index.html HTTP/1.1
Host: www.python.org
Accept-Encoding: identity
Referer: /
User-agent: test
回答by jedie
A other solution, witch used the idea from How do you get default headers in a urllib2 Request?But doesn't copy code from std-lib:
另一个解决方案,女巫使用了How do you get default headers in a urllib2 Request? 但不从 std-lib 复制代码:
class HTTPConnection2(httplib.HTTPConnection):
"""
Like httplib.HTTPConnection but stores the request headers.
Used in HTTPConnection3(), see below.
"""
def __init__(self, *args, **kwargs):
httplib.HTTPConnection.__init__(self, *args, **kwargs)
self.request_headers = []
self.request_header = ""
def putheader(self, header, value):
self.request_headers.append((header, value))
httplib.HTTPConnection.putheader(self, header, value)
def send(self, s):
self.request_header = s
httplib.HTTPConnection.send(self, s)
class HTTPConnection3(object):
"""
Wrapper around HTTPConnection2
Used in HTTPHandler2(), see below.
"""
def __call__(self, *args, **kwargs):
"""
instance made in urllib2.HTTPHandler.do_open()
"""
self._conn = HTTPConnection2(*args, **kwargs)
self.request_headers = self._conn.request_headers
self.request_header = self._conn.request_header
return self
def __getattribute__(self, name):
"""
Redirect attribute access to the local HTTPConnection() instance.
"""
if name == "_conn":
return object.__getattribute__(self, name)
else:
return getattr(self._conn, name)
class HTTPHandler2(urllib2.HTTPHandler):
"""
A HTTPHandler which stores the request headers.
Used HTTPConnection3, see above.
>>> opener = urllib2.build_opener(HTTPHandler2)
>>> opener.addheaders = [("User-agent", "Python test")]
>>> response = opener.open('http://www.python.org/')
Get the request headers as a list build with HTTPConnection.putheader():
>>> response.request_headers
[('Accept-Encoding', 'identity'), ('Host', 'www.python.org'), ('Connection', 'close'), ('User-Agent', 'Python test')]
>>> response.request_header
'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.python.org\r\nConnection: close\r\nUser-Agent: Python test\r\n\r\n'
"""
def http_open(self, req):
conn_instance = HTTPConnection3()
response = self.do_open(conn_instance, req)
response.request_headers = conn_instance.request_headers
response.request_header = conn_instance.request_header
return response
EDIT: Update the source
编辑:更新源
回答by dcolish
It sounds to me like you're looking for the headers of the response object, which include Connection: close
, etc. These headers live in the object returned by urlopen. Getting at them is easy enough:
在我看来,您正在寻找响应对象的标头,其中包括Connection: close
等。这些标头位于 urlopen 返回的对象中。获得它们很容易:
from urllib2 import urlopen
req = urlopen("http://www.google.com")
print req.headers.headers
req.headers
is a instance of httplib.HTTPMessage
req.headers
是httplib.HTTPMessage 的一个实例
回答by Mykola Kharechko
see urllib2.py:do_request (line 1044 (1067)) and urllib2.py:do_open (line 1073) (line 293) self.addheaders = [('User-agent', client_version)] (only 'User-agent' added)
参见 urllib2.py:do_request (line 1044 (1067)) 和 urllib2.py:do_open (line 1073) (line 293) self.addheaders = [('User-agent', client_version)] (只添加了'User-agent' )
回答by John T
It should send the default http headers (as specified by w3.org) alongside the ones you specify. You can use a tool like WireSharkif you would like to see them in their entirety.
它应该将默认的 http 标头(由w3.org指定)与您指定的标头一起发送。如果您想完整地查看它们,可以使用像WireShark这样的工具。
Edit:
编辑:
If you would like to log them, you can use WinPcapto capture packets sent by specific applications (in your case, python). You can also specify the type of packets and many other details.
如果您想记录它们,您可以使用WinPcap来捕获特定应用程序(在您的情况下,python)发送的数据包。您还可以指定数据包的类型和许多其他详细信息。
-John
-约翰