如何使用 Bash 解析 HTTP 标头？

Question

提问by jpshook

I need to get 2 values from a web page header that I am getting using curl. I have been able to get the values individually using:

我需要从使用 curl 的网页标题中获取 2 个值。我已经能够使用以下方法单独获取值：

response1=$(curl -I -s http://www.example.com | grep HTTP/1.1 | awk {'print '})
response2=$(curl -I -s http://www.example.com | grep Server: | awk {'print '})

But I cannot figure out how to grep the values separately using a single curl request like:

但我无法弄清楚如何使用单个 curl 请求分别 grep 值，例如：

response=$(curl -I -s http://www.example.com)
http_status=$response | grep HTTP/1.1 | awk {'print '}
server=$response | grep Server: | awk {'print '}

Every attempt either leads to a error message or empty values. I am sure it is just a syntax issue.

每次尝试都会导致错误消息或空值。我确定这只是一个语法问题。

Answer 1

回答by Sylvain Leroux

Full bashsolution. Demonstrate how to easily parse other headers without requiring awk:

完整的bash解决方案。演示如何轻松解析其他标头而无需awk：

shopt -s extglob # Required to trim whitespace; see below

while IFS=':' read key value; do
    # trim whitespace in "value"
    value=${value##+([[:space:]])}; value=${value%%+([[:space:]])}

    case "$key" in
        Server) SERVER="$value"
                ;;
        Content-Type) CT="$value"
                ;;
        HTTP*) read PROTO STATUS MSG <<< "$key{$value:+:$value}"
                ;;
     esac
done < <(curl -sI http://www.google.com)
echo $STATUS
echo $SERVER
echo $CT

Producing:

生产：

302
GFE/2.0
text/html; charset=UTF-8

According to RFC-2616, HTTP headers are modeled as described in "Standard for the Format of ARPA Internet Text Messages"(RFC822), which states clearly section 3.1.2:

根据RFC-2616，HTTP 标头按照“ARPA Internet 文本消息格式标准”(RFC822) 中的描述建模，该标准在第 3.1.2 节中明确说明：

The field-name must be composed of printable ASCII characters (i.e., characters that have values between 33. and 126., decimal, except colon). The field-body may be composed of any ASCII characters, except CR or LF. (While CR and/or LF may be present in the actual text, they are removed by the action of unfolding the field.)

字段名必须由可打印的 ASCII 字符组成（即，值在 33. 和 126. 之间的字符，十进制，冒号除外）。字段主体可以由除 CR 或 LF 之外的任何 ASCII 字符组成。（虽然 CR 和/或 LF 可能出现在实际文本中，但它们会通过展开字段的操作而被删除。）

So the above script shouldcatch any RFC-[2]822 compliant header with the notable exception of folded headers.

所以上面的脚本应该捕获任何符合 RFC-[2]822 的标头，折叠标头除外。

Answer 2

回答by rici

If you wanted to extract more than a couple of headers, you could stuff all the headers into a bash associative array. Here's a simple-minded function which assumes that any given header only occurs once. (Don't use it for Set-Cookie; see below.)

如果您想提取多个标头，您可以将所有标头填充到一个 bash 关联数组中。这是一个简单的函数，它假设任何给定的标头只出现一次。（不要将它用于Set-Cookie；见下文。）

# Call this as: headers ARRAY URL
headers () {
  {
    # (Re)define the specified variable as an associative array.
    unset ;
    declare -gA ;
    local line rest

    # Get the first line, assuming HTTP/1.0 or above. Note that these fields
    # have Capitalized names.
    IFS=$' \t\n\r' read [Proto] [Status] rest
    # Drop the CR from the message, if there was one.
    declare -gA [Message]="${rest%$'\r'}"
    # Now read the rest of the headers. 
    while true; do
      # Get rid of the trailing CR if there is one.
      IFS=$'\r' read line rest;
      # Stop when we hit an empty line
      if [[ -z $line ]]; then break; fi
      # Make sure it looks like a header
      # This regex also strips leading and trailing spaces from the value
      if [[ $line =~ ^([[:alnum:]_-]+):\ *(( *[^ ]+)*)\ *$ ]]; then
        # Force the header to lower case, since headers are case-insensitive,
        # and store it into the array
        declare -gA [${BASH_REMATCH[1],,}]="${BASH_REMATCH[2]}"
      else
        printf "Ignoring non-header line: %q\n" "$line" >> /dev/stderr
      fi
    done
  } < <(curl -Is "")
}

Example:

例子：

$ headers so http://stackoverflow.com/
$ for h in ${!so[@]}; do printf "%s=%s\n" $h "${so[$h]}"; done | sort
Message=OK
Proto=HTTP/1.1
Status=200
cache-control=public, no-cache="Set-Cookie", max-age=43
content-length=224904
content-type=text/html; charset=utf-8
date=Fri, 25 Jul 2014 17:35:16 GMT
expires=Fri, 25 Jul 2014 17:36:00 GMT
last-modified=Fri, 25 Jul 2014 17:35:00 GMT
set-cookie=prov=205fd7f3-10d4-4197-b03a-252b60df7653; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
vary=*
x-frame-options=SAMEORIGIN

Note that the SO response includes one or more cookies, in Set-Cookieheaders, but we can only see the last one because the naive script overwrites entries with the same header name. (As it happens, there was only one but we can't know that.) While it would be possible to augment the script to special case Set-Cookie, a better approach would probably be to provide a cookie-jar file, and use the -band -ccurl options in order to maintain it.

请注意，SO 响应在Set-Cookie标头中包含一个或多个 cookie，但我们只能看到最后一个，因为原始脚本会覆盖具有相同标头名称的条目。（碰巧只有一个，但我们不知道。）虽然可以将脚本增加到特殊情况Set-Cookie，但更好的方法可能是提供一个 cookie-jar 文件，并使用-b和-ccurl选项以维护它。

Answer 3

回答by Sylvain Leroux

Using process substitution, (<( ... )) you are able to read into shell variable:

使用进程替换， ( <( ... )) 您可以读入 shell 变量：

sh$ read STATUS SERVER < <(
      curl -sI http://www.google.com | 
      awk '/^HTTP/ { STATUS =  } 
           /^Server:/ { SERVER =  } 
           END { printf("%s %s\n",STATUS, SERVER) }'
    )

sh$ echo $STATUS
302
sh$ $ echo $SERVER
GFE/2.0

如何使用 Bash 解析 HTTP 标头？

提问by jpshook

回答by Sylvain Leroux

回答by rici

回答by Sylvain Leroux

相关推荐

最近更新

标签

如何使用 Bash 解析 HTTP 标头？

提问by jpshook

回答by Sylvain Leroux

回答by rici

回答by Sylvain Leroux

相关推荐

bash 意外标记“do”附近的语法错误

Bash one liner - 测试一个文件是否存在，如果它执行其他错误退出并提供源

sed bash 脚本中`s' 的未知选项

bash 找到匹配后添加新行

相关推荐

最近更新

标签