Python 解析原始 HTTP 标头

Question

提问by Cev

I have a string of raw HTTP and I would like to represent the fields in an object. Is there any way to parse the individual headers from an HTTP string?

我有一个原始 HTTP 字符串，我想表示一个对象中的字段。有什么方法可以解析 HTTP 字符串中的各个标头？

'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n
[...]'

Answer 1

采纳答案by Brandon Rhodes

Update:It's 2019, so I have rewritten this answer for Python 3, following a confused comment from a programmer trying to use the code. The original Python 2 code is now down at the bottom of the answer.

更新：现在是 2019 年，所以我根据试图使用代码的程序员的困惑评论为 Python 3 重写了这个答案。原始 Python 2 代码现在位于答案的底部。

There are excellent tools in the Standard Library both for parsing RFC?821 headers, and also for parsing entire HTTP requests. Here is an example request string (note that Python treats it as one big string, even though we are breaking it across several lines for readability) that we can feed to my examples:

标准库中有很多优秀的工具可以用于解析 RFC?821 标头，也可以用于解析整个 HTTP 请求。这是一个示例请求字符串（请注意，Python 将其视为一个大字符串，即使为了可读性我们将其分成几行），我们可以将其提供给我的示例：

request_text = (
    b'GET /who/ken/trust.html HTTP/1.1\r\n'
    b'Host: cm.bell-labs.com\r\n'
    b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
    b'Accept: text/html;q=0.9,text/plain\r\n'
    b'\r\n'
)

As @TryPyPy points out, you can use Python's email message library to parse the headers — though we should add that the resulting Messageobject acts like a dictionary of headers once you are done creating it:

正如@TryPyPy 指出的那样，您可以使用 Python 的电子邮件消息库来解析标题——尽管我们应该补充一点，Message一旦您完成创建，生成的对象就像一个标题字典：

from email.parser import BytesParser
request_line, headers_alone = request_text.split(b'\r\n', 1)
headers = BytesParser().parsebytes(headers_alone)

print(len(headers))     # -> "3"
print(headers.keys())   # -> ['Host', 'Accept-Charset', 'Accept']
print(headers['Host'])  # -> "cm.bell-labs.com"

But this, of course, ignores the request line, or makes you parse it yourself. It turns out that there is a much better solution.

但这当然会忽略请求行，或者让您自己解析它。事实证明，有一个更好的解决方案。

The Standard Library will parse HTTP for you if you use its BaseHTTPRequestHandler. Though its documentation is a bit obscure — a problem with the whole suite of HTTP and URL tools in the Standard Library — all you have to do to make it parse a string is (a) wrap your string in a BytesIO(), (b) read the raw_requestlineso that it stands ready to be parsed, and (c) capture any error codes that occur during parsing instead of letting it try to write them back to the client (since we do not have one!).

如果您使用它的BaseHTTPRequestHandler. 尽管它的文档有点晦涩——这是标准库中整套 HTTP 和 URL 工具的一个问题——让它解析字符串所需要做的就是 (a) 将字符串包裹在 a 中BytesIO()，(b) 阅读raw_requestline以便它准备好进行解析，并且 (c) 捕获解析过程中发生的任何错误代码，而不是让它尝试将它们写回客户端（因为我们没有！）。

So here is our specialization of the Standard Library class:

所以这是我们对标准库类的专业化：

from http.server import BaseHTTPRequestHandler
from io import BytesIO

class HTTPRequest(BaseHTTPRequestHandler):
    def __init__(self, request_text):
        self.rfile = BytesIO(request_text)
        self.raw_requestline = self.rfile.readline()
        self.error_code = self.error_message = None
        self.parse_request()

    def send_error(self, code, message):
        self.error_code = code
        self.error_message = message

Again, I wish the Standard Library folks had realized that HTTP parsing should be broken out in a way that did not require us to write nine lines of code to properly call it, but what can you do? Here is how you would use this simple class:

再次，我希望标准库的人已经意识到 HTTP 解析应该以不需要我们编写九行代码来正确调用它的方式进行，但是你能做什么？以下是您将如何使用这个简单的类：

# Using this new class is really easy!

request = HTTPRequest(request_text)

print(request.error_code)       # None  (check this first)
print(request.command)          # "GET"
print(request.path)             # "/who/ken/trust.html"
print(request.request_version)  # "HTTP/1.1"
print(len(request.headers))     # 3
print(request.headers.keys())   # ['Host', 'Accept-Charset', 'Accept']
print(request.headers['host'])  # "cm.bell-labs.com"

If there is an error during parsing, the error_codewill not be None:

如果在解析过程中出现错误，error_code则不会是None：

# Parsing can result in an error code and message

request = HTTPRequest(b'GET\r\nHeader: Value\r\n\r\n')

print(request.error_code)     # 400
print(request.error_message)  # "Bad request syntax ('GET')"

I prefer using the Standard Library like this because I suspect that they have already encountered and resolved any edge cases that might bite me if I try re-implementing an Internet specification myself with regular expressions.

我更喜欢像这样使用标准库，因为我怀疑如果我尝试自己使用正则表达式重新实现 Internet 规范，他们已经遇到并解决了可能会困扰我的任何边缘情况。

Old Python 2 code

旧的 Python 2 代码

Here's the original code for this answer, back when I first wrote it:

这是这个答案的原始代码，回到我第一次写的时候：

request_text = (
    'GET /who/ken/trust.html HTTP/1.1\r\n'
    'Host: cm.bell-labs.com\r\n'
    'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
    'Accept: text/html;q=0.9,text/plain\r\n'
    '\r\n'
    )

And:

和：

# Ignore the request line and parse only the headers

from mimetools import Message
from StringIO import StringIO
request_line, headers_alone = request_text.split('\r\n', 1)
headers = Message(StringIO(headers_alone))

print len(headers)     # -> "3"
print headers.keys()   # -> ['accept-charset', 'host', 'accept']
print headers['Host']  # -> "cm.bell-labs.com"

And:

和：

from BaseHTTPServer import BaseHTTPRequestHandler
from StringIO import StringIO

class HTTPRequest(BaseHTTPRequestHandler):
    def __init__(self, request_text):
        self.rfile = StringIO(request_text)
        self.raw_requestline = self.rfile.readline()
        self.error_code = self.error_message = None
        self.parse_request()

    def send_error(self, code, message):
        self.error_code = code
        self.error_message = message

And:

和：

# Using this new class is really easy!

request = HTTPRequest(request_text)

print request.error_code       # None  (check this first)
print request.command          # "GET"
print request.path             # "/who/ken/trust.html"
print request.request_version  # "HTTP/1.1"
print len(request.headers)     # 3
print request.headers.keys()   # ['accept-charset', 'host', 'accept']
print request.headers['host']  # "cm.bell-labs.com"

And:

和：

# Parsing can result in an error code and message

request = HTTPRequest('GET\r\nHeader: Value\r\n\r\n')

print request.error_code     # 400
print request.error_message  # "Bad request syntax ('GET')"

Answer 2

回答by TryPyPy

This seems to work fine if you strip the GETline:

如果你去掉这条GET线，这似乎工作正常：

import mimetools
from StringIO import StringIO

he = "Host: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n"

m = mimetools.Message(StringIO(he))

print m.headers

A way to parse your example and add information from the first line to the object would be:

解析示例并将信息从第一行添加到对象的方法是：

import mimetools
from StringIO import StringIO

he = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\n'

# Pop the first line for further processing
request, he = he.split('\r\n', 1)    

# Get the headers
m = mimetools.Message(StringIO(he))

# Add request information
m.dict['method'], m.dict['path'], m.dict['http-version'] = request.split()    

print m['method'], m['path'], m['http-version']
print m['Connection']
print m.headers
print m.dict

Answer 3

回答by Gowtham

mimetoolshas been deprecated since Python 2.3 and totally removed from Python 3 (link).

mimetools自 Python 2.3 起已被弃用，并从 Python 3 中完全删除（链接）。

Here is how you should do in Python 3:

以下是您在 Python 3 中应该如何做：

import email
import io
import pprint

# […]

request_line, headers_alone = request_text.split('\r\n', 1)
message = email.message_from_file(io.StringIO(headers_alone))
headers = dict(message.items())
pprint.pprint(headers, width=160)

Answer 4

回答by jmunsch

Using python3.7, urllib3.HTTPResponse, http.client.parse_headers, and with curl flag explanation here:

使用python3.7，，urllib3.HTTPResponse，http.client.parse_headers并与这里卷曲标志的解释：

curl -i -L -X GET "http://httpbin.org/relative-redirect/3" |  python -c '
import sys
from io import BytesIO
from urllib3 import HTTPResponse
from http.client import parse_headers

rawresponse = sys.stdin.read().encode("utf8")
redirects = []

while True:
    header, body = rawresponse.split(b"\r\n\r\n", 1)
    if body[:4] == b"HTTP":
        redirects.append(header)
        rawresponse = body
    else:
        break

f = BytesIO(header)
# read one line for HTTP/2 STATUSCODE MESSAGE
requestline = f.readline().split(b" ")
protocol, status = requestline[:2]
headers = parse_headers(f)

resp = HTTPResponse(body, headers=headers)
resp.status = int(status)

print("headers")
print(resp.headers)

print("redirects")
print(redirects)
'

Output:

输出：

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   215  100   215    0     0    435      0 --:--:-- --:--:-- --:--:--   435

headers
HTTPHeaderDict({'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Date': 'Thu, 20 Sep 2018 05:39:25 GMT', 'Content-Type': 'application/json', 'Content-Length': '215', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'})
redirects
[b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /relative-redirect/2\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur',
 b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /relative-redirect/1\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur',
 b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /get\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur']

notes:

笔记：

Answer 5

回答by Misha Shaygu

in python3

在python3中

from email import message_from_string    
data = socket.recv(4096)
headers = message_from_string(str(data, 'ASCII').split('\r\n', 1)[1])
print(headers['Host'])

Answer 6

回答by Wellington Rats

In a pythonic way

以pythonic的方式

request_text = (
    b'GET /who/ken/trust.html HTTP/1.1\r\n'
    b'Host: cm.bell-labs.com\r\n'
    b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
    b'Accept: text/html;q=0.9,text/plain\r\n'
    b'\r\n'
)

print({ k:v.strip() for k,v in [line.split(":",1) 
        for line in request_text.decode().splitlines() if ":" in line]})

Answer 7

回答by Ousret

They is another way, simpler and safer way to handle headers. More object oriented. With no need for manual parsing.

它们是处理标头的另一种更简单、更安全的方式。更加面向对象。无需手动解析。

Short demo.

简短的演示。

1. Parse them

1. 解析它们

From str, bytes, fp, dict, requests.Response, email.Message, httpx.Response, urllib3.HTTPResponse.

从str, bytes, fp, dict, requests.Response, email.Message, httpx.Response, urllib3.HTTPResponse.

from requests import get
from kiss_headers import parse_it

response = get('https://www.google.fr')
headers = parse_it(response)

headers.content_type.charset  # output: ISO-8859-1
# Its the same as
headers["content-type"]["charset"]  # output: ISO-8859-1

2. Build them

2. 构建它们

This

这个

from kiss_headers import *

headers = (
    Host("developer.mozilla.org")
    + UserAgent(
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0"
    )
    + Accept("text/html")
    + Accept("application/xhtml+xml")
    + Accept("application/xml", qualifier=0.9)
    + Accept(qualifier=0.8)
    + AcceptLanguage("en-US")
    + AcceptLanguage("en", qualifier=0.5)
    + AcceptEncoding("gzip")
    + AcceptEncoding("deflate")
    + AcceptEncoding("br")
    + Referer("https://developer.mozilla.org/testpage.html")
    + Connection(should_keep_alive=True)
    + UpgradeInsecureRequests()
    + IfModifiedSince("Mon, 18 Jul 2016 02:36:04 GMT")
    + IfNoneMatch("c561c68d0ba92bbeb8b0fff2a9199f722e3a621a")
    + CacheControl(max_age=0)
)

raw_headers = str(headers)

Will become

会变成

Host: developer.mozilla.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: text/html, application/xhtml+xml, application/xml; q="0.9", */*; q="0.8"
Accept-Language: en-US, en; q="0.5"
Accept-Encoding: gzip, deflate, br
Referer: https://developer.mozilla.org/testpage.html
Connection: keep-alive
Upgrade-Insecure-Requests: 1
If-Modified-Since: Mon, 18 Jul 2016 02:36:04 GMT
If-None-Match: "c561c68d0ba92bbeb8b0fff2a9199f722e3a621a"
Cache-Control: max-age="0"

Documentation for the kiss-headers library.

亲吻头库的文档。

Python 解析原始 HTTP 标头

提问by Cev

采纳答案by Brandon Rhodes

Old Python 2 code

旧的 Python 2 代码

回答by TryPyPy

回答by Gowtham

回答by jmunsch

回答by Misha Shaygu

回答by Wellington Rats

回答by Ousret

相关推荐

最近更新

标签

Python 解析原始 HTTP 标头

提问by Cev

采纳答案by Brandon Rhodes

Old Python 2 code

旧的 Python 2 代码

回答by TryPyPy

回答by Gowtham

回答by jmunsch

回答by Misha Shaygu

回答by Wellington Rats

回答by Ousret

相关推荐

Python Matplotlib：在多个线程中同时绘图

如何在 Python 中发送 xml 请求并接收 xml 响应？

Python PIL 旋转图像颜色 (BGR -> RGB)

Python 使用 Django 实现单点登录 (SSO)

相关推荐

最近更新

标签