bash 用于获取 url 列表的 HTTP 状态代码的脚本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6136022/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 20:33:40  来源:igfitidea点击:

Script to get the HTTP status code of a list of urls?

bashcurlhttp-status-codes

提问by Manu

I have a list of URLS that I need to check, to see if they still work or not. I would like to write a bash script that does that for me.

我有一个需要检查的 URL 列表,以查看它们是否仍然有效。我想编写一个 bash 脚本来为我做这件事。

I only need the returned HTTP status code, i.e. 200, 404, 500 and so forth. Nothing more.

我只需要返回的 HTTP 状态代码,即 200、404、500 等等。而已。

EDITNote that there is an issue if the page says "404 not found" but returns a 200 OK message. It's a misconfigured web server, but you may have to consider this case.

编辑请注意,如果页面显示“未找到 404”但返回 200 OK 消息,则会出现问题。这是一个配置错误的 Web 服务器,但您可能必须考虑这种情况。

For more on this, see Check if a URL goes to a page containing the text "404"

有关更多信息,请参阅检查 URL 是否转到包含文本“404”的页面

回答by Phil

Curl has a specific option, --write-out, for this:

Curl 有一个特定的选项,--write-out,为此:

$ curl -o /dev/null --silent --head --write-out '%{http_code}\n' <url>
200
  • -o /dev/nullthrows away the usual output
  • --silentthrows away the progress meter
  • --headmakes a HEAD HTTP request, instead of GET
  • --write-out '%{http_code}\n'prints the required status code
  • -o /dev/null丢弃通常的输出
  • --silent扔掉进度表
  • --head发出 HEAD HTTP 请求,而不是 GET
  • --write-out '%{http_code}\n'打印所需的状态代码

To wrap this up in a complete Bash script:

要将其包装在一个完整的 Bash 脚本中:

#!/bin/bash
while read LINE; do
  curl -o /dev/null --silent --head --write-out "%{http_code} $LINE\n" "$LINE"
done < url-list.txt

(Eagle-eyed readers will notice that this uses one curl process per URL, which imposes fork and TCP connection penalties. It would be faster if multiple URLs were combined in a single curl, but there isn't space to write out the monsterous repetition of options that curl requires to do this.)

(眼尖的读者会注意到,这对每个 URL 使用一个 curl 进程,这会造成 fork 和 TCP 连接惩罚。如果将多个 URL 组合在一个 curl 中会更快,但没有空间写出巨大的重复curl 执行此操作所需的选项。)

回答by user551168

wget --spider -S "http://url/to/be/checked" 2>&1 | grep "HTTP/" | awk '{print }'

prints only the status code for you

只为您打印状态代码

回答by estani

Extending the answer already provided by Phil. Adding parallelism to it is a no brainer in bash if you use xargs for the call.

扩展菲尔已经提供的答案。如果您使用 xargs 进行调用,那么在 bash 中添加并行性是显而易见的。

Here the code:

这里的代码:

xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective}: %{http_code}\n' < url.lst

-n1: use just one value (from the list) as argument to the curl call

-n1:仅使用一个值(来自列表)作为 curl 调用的参数

-P10: Keep 10 curl processes alive at any time (i.e. 10 parallel connections)

-P10随时保持 10 个 curl 进程处于活动状态(即 10 个并行连接)

Check the write_outparameter in the manual of curl for more data you can extract using it (times, etc).

检查write_outcurl 手册中的参数以获取更多可以使用它提取的数据(时间等)。

In case it helps someone this is the call I'm currently using:

如果它对某人有帮助,这是我目前正在使用的电话:

xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective};%{http_code};%{time_total};%{time_namelookup};%{time_connect};%{size_download};%{speed_download}\n' < url.lst | tee results.csv

It just outputs a bunch of data into a csv file that can be imported into any office tool.

它只是将一堆数据输出到一个可以导入任何办公工具的 csv 文件中。

回答by Salathiel Genèse

This relies on widely available wget, present almost everywhere, even on Alpine Linux.

这依赖于广泛可用的wget,几乎无处不在,甚至在 Alpine Linux 上。

wget --server-response --spider --quiet "${url}" 2>&1 | awk 'NR==1{print }'

The explanations are as follow :

解释如下:

--quiet

--quiet

Turn off Wget's output.

Source - wget man pages

关闭 Wget 的输出。

来源 - wget 手册页

--spider

--spider

[ ... ] it will not download the pages, just check that they are there. [ ... ]

Source - wget man pages

[ ... ] 它不会下载页面,只需检查它们是否在那里。[...]

来源 - wget 手册页

--server-response

--server-response

Print the headers sent by HTTP servers and responses sent by FTP servers.

Source - wget man pages

打印 HTTP 服务器发送的标头和 FTP 服务器发送的响应。

来源 - wget 手册页

What they don't say about --server-responseis that those headers output are printed to standard error (sterr), thus the need to redirectto stdin.

他们没有说的--server-response是,这些标头输出被打印到标准错误 (sterr),因此需要重定向到 stdin。

The output sent to standard input, we can pipe it to awkto extract the HTTP status code. That code is :

输出发送到标准输入,我们可以通过管道将awk其提取到 HTTP 状态代码。该代码是:

  • the second ($2) non-blank group of characters: {$2}
  • on the very first line of the header: NR==1
  • 第二个 ( $2) 非空白字符组:{$2}
  • 在标题的第一行: NR==1

And because we want to print it... {print $2}.

因为我们想打印它... {print $2}

wget --server-response --spider --quiet "${url}" 2>&1 | awk 'NR==1{print }'

回答by dogbane

Use curlto fetch the HTTP-header only (not the whole file) and parse it:

用于curl仅获取 HTTP 标头(不是整个文件)并解析它:

$ curl -I  --stderr /dev/null http://www.google.co.uk/index.html | head -1 | cut -d' ' -f2
200

回答by colinross

wget -S -i *file*will get you the headers from each url in a file.

wget -S -i *file*将从文件中的每个 url 获取标题。

Filter though grepfor the status code specifically.

过滤虽然grep专门针对状态代码。

回答by Ole Tange

Due to https://mywiki.wooledge.org/BashPitfalls#Non-atomic_writes_with_xargs_-P(output from parallel jobs in xargsrisks being mixed), I would use GNU Parallel instead of xargsto parallelize:

由于https://mywiki.wooledge.org/BashPitfalls#Non-atomic_writes_with_xargs_-Pxargs混合风险中并行作业的输出),我将使用 GNU Parallel 而不是xargs并行化:

cat url.lst |
  parallel -P0 -q curl -o /dev/null --silent --head --write-out '%{url_effective}: %{http_code}\n' > outfile

In this particular case it may be safe to use xargsbecause the output is so short, so the problem with using xargsis rather that if someone later changes the code to do something bigger, it will no longer be safe. Or if someone reads this question and thinks he can replace curlwith something else, then that may also not be safe.

在这种特殊情况下,使用它可能是安全的,xargs因为输出很短,所以使用的问题xargs在于,如果有人稍后更改代码以做更大的事情,它将不再安全。或者如果有人读到这个问题并认为他可以curl用其他东西代替,那么这也可能不安全。

回答by Yura Loginov

I found a tool "webchk” written in Python. Returns a status code for a list of urls. https://pypi.org/project/webchk/

我发现了一个工具“webchk” Python编写的。返回一个状态代码的URL列表。 https://pypi.org/project/webchk/

Output looks like this:

输出如下所示:

? webchk -i ./dxieu.txt | grep '200'
http://salesforce-case-status.dxi.eu/login ... 200 OK (0.108)
https://support.dxi.eu/hc/en-gb ... 200 OK (0.389)
https://support.dxi.eu/hc/en-gb ... 200 OK (0.401)

Hope that helps!

希望有帮助!