如何使用 Bash 解析 HTTP 标头?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24943170/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 10:56:50  来源:igfitidea点击:

How to parse HTTP headers using Bash?

linuxbashcurl

提问by jpshook

I need to get 2 values from a web page header that I am getting using curl. I have been able to get the values individually using:

我需要从使用 curl 的网页标题中获取 2 个值。我已经能够使用以下方法单独获取值:

response1=$(curl -I -s http://www.example.com | grep HTTP/1.1 | awk {'print '})
response2=$(curl -I -s http://www.example.com | grep Server: | awk {'print '})

But I cannot figure out how to grep the values separately using a single curl request like:

但我无法弄清楚如何使用单个 curl 请求分别 grep 值,例如:

response=$(curl -I -s http://www.example.com)
http_status=$response | grep HTTP/1.1 | awk {'print '}
server=$response | grep Server: | awk {'print '}

Every attempt either leads to a error message or empty values. I am sure it is just a syntax issue.

每次尝试都会导致错误消息或空值。我确定这只是一个语法问题。

回答by Sylvain Leroux

Full bashsolution. Demonstrate how to easily parse other headers without requiring awk:

完整的bash解决方案。演示如何轻松解析其他标头而无需awk

shopt -s extglob # Required to trim whitespace; see below

while IFS=':' read key value; do
    # trim whitespace in "value"
    value=${value##+([[:space:]])}; value=${value%%+([[:space:]])}

    case "$key" in
        Server) SERVER="$value"
                ;;
        Content-Type) CT="$value"
                ;;
        HTTP*) read PROTO STATUS MSG <<< "$key{$value:+:$value}"
                ;;
     esac
done < <(curl -sI http://www.google.com)
echo $STATUS
echo $SERVER
echo $CT

Producing:

生产:

302
GFE/2.0
text/html; charset=UTF-8


According to RFC-2616, HTTP headers are modeled as described in "Standard for the Format of ARPA Internet Text Messages"(RFC822), which states clearly section 3.1.2:

根据RFC-2616,HTTP 标头按照“ARPA Internet 文本消息格式标准”(RFC822) 中的描述建模,该标准在第 3.1.2 节中明确说明:

The field-name must be composed of printable ASCII characters (i.e., characters that have values between 33. and 126., decimal, except colon). The field-body may be composed of any ASCII characters, except CR or LF. (While CR and/or LF may be present in the actual text, they are removed by the action of unfolding the field.)

字段名必须由可打印的 ASCII 字符组成(即,值在 33. 和 126. 之间的字符,十进制,冒号除外)。字段主体可以由除 CR 或 LF 之外的任何 ASCII 字符组成。(虽然 CR 和/或 LF 可能出现在实际文本中,但它们会通过展开字段的操作而被删除。)

So the above script shouldcatch any RFC-[2]822 compliant header with the notable exception of folded headers.

所以上面的脚本应该捕获任何符合 RFC-[2]822 的标头折叠标头除外

回答by rici

If you wanted to extract more than a couple of headers, you could stuff all the headers into a bash associative array. Here's a simple-minded function which assumes that any given header only occurs once. (Don't use it for Set-Cookie; see below.)

如果您想提取多个标头,您可以将所有标头填充到一个 bash 关联数组中。这是一个简单的函数,它假设任何给定的标头只出现一次。(不要将它用于Set-Cookie;见下文。)

# Call this as: headers ARRAY URL
headers () {
  {
    # (Re)define the specified variable as an associative array.
    unset ;
    declare -gA ;
    local line rest

    # Get the first line, assuming HTTP/1.0 or above. Note that these fields
    # have Capitalized names.
    IFS=$' \t\n\r' read [Proto] [Status] rest
    # Drop the CR from the message, if there was one.
    declare -gA [Message]="${rest%$'\r'}"
    # Now read the rest of the headers. 
    while true; do
      # Get rid of the trailing CR if there is one.
      IFS=$'\r' read line rest;
      # Stop when we hit an empty line
      if [[ -z $line ]]; then break; fi
      # Make sure it looks like a header
      # This regex also strips leading and trailing spaces from the value
      if [[ $line =~ ^([[:alnum:]_-]+):\ *(( *[^ ]+)*)\ *$ ]]; then
        # Force the header to lower case, since headers are case-insensitive,
        # and store it into the array
        declare -gA [${BASH_REMATCH[1],,}]="${BASH_REMATCH[2]}"
      else
        printf "Ignoring non-header line: %q\n" "$line" >> /dev/stderr
      fi
    done
  } < <(curl -Is "")
}

Example:

例子:

$ headers so http://stackoverflow.com/
$ for h in ${!so[@]}; do printf "%s=%s\n" $h "${so[$h]}"; done | sort
Message=OK
Proto=HTTP/1.1
Status=200
cache-control=public, no-cache="Set-Cookie", max-age=43
content-length=224904
content-type=text/html; charset=utf-8
date=Fri, 25 Jul 2014 17:35:16 GMT
expires=Fri, 25 Jul 2014 17:36:00 GMT
last-modified=Fri, 25 Jul 2014 17:35:00 GMT
set-cookie=prov=205fd7f3-10d4-4197-b03a-252b60df7653; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
vary=*
x-frame-options=SAMEORIGIN

Note that the SO response includes one or more cookies, in Set-Cookieheaders, but we can only see the last one because the naive script overwrites entries with the same header name. (As it happens, there was only one but we can't know that.) While it would be possible to augment the script to special case Set-Cookie, a better approach would probably be to provide a cookie-jar file, and use the -band -ccurl options in order to maintain it.

请注意,SO 响应在Set-Cookie标头中包含一个或多个 cookie,但我们只能看到最后一个,因为原始脚本会覆盖具有相同标头名称的条目。(碰巧只有一个,但我们不知道。)虽然可以将脚本增加到特殊情况Set-Cookie,但更好的方法可能是提供一个 cookie-jar 文件,并使用-b-ccurl选项以维护它。

回答by Sylvain Leroux

Using process substitution, (<( ... )) you are able to read into shell variable:

使用进程替换, ( <( ... )) 您可以读入 shell 变量:

sh$ read STATUS SERVER < <(
      curl -sI http://www.google.com | 
      awk '/^HTTP/ { STATUS =  } 
           /^Server:/ { SERVER =  } 
           END { printf("%s %s\n",STATUS, SERVER) }'
    )

sh$ echo $STATUS
302
sh$ $ echo $SERVER
GFE/2.0