bash 仅使用 sed 和 wget 检索链接

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9899760/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 01:53:14  来源:igfitidea点击:

Use sed and wget to retrieve links only

linuxbashsed

提问by leeman24

What I need to do is retrieve a link through a command such as:

我需要做的是通过以下命令检索链接:

wget --quiet -O - linkname

wget --quiet -O - linkname

Then pipe it to sed to just display ONLY the links on the page not the formatting.

然后将其通过管道传输到 sed 以仅显示页面上的链接而不是格式。

What i got so far only displays lines with the all the html code along side of it.

到目前为止,我所得到的仅显示带有所有 html 代码的行。

回答by kev

You can pipe the result to grepwith -o(match-only) option:

您可以grep使用-o(仅匹配)选项将结果通过管道传输:

$ wget --quiet -O - http://stackoverflow.com | grep -o 'http://[^"]*'

To get all url inside href="...":

获取里面的所有网址href="..."

grep -oP '(?<=href=")[^"]*(?=")'

回答by leeman24

I believe this is what I was looking for.

我相信这就是我一直在寻找的。

sed -n "/href/ s/.*href=['\"]\([^'\"]*\)['\"].*//gp"

回答by kerkael

grep "<a href=" sourcepage.html
  |sed "s/<a href/\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
  1. The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
  2. The first sed will add a newline in front of each a hrefurl tag with the \n
  3. The second sed will shorten each url after the 2nd " in the line by replacing it with the /atag with a newline Both seds will give you each url on a single line, but there is garbage, so
  4. The 2nd grep href cleans the mess up
  5. The sort and uniq will give you one instance of each existing url present in the sourcepage.html
  1. 第一个 grep 查找包含 url 的行。如果您只想查看本地页面,则可以在之后添加更多元素,因此不需要 http,而是相对路径。
  2. 第一个 sed 将在每个带有 \n的 hrefurl 标签前添加一个换行符
  3. 第二个 sed 将缩短行中第 2 个 " 之后的每个 url,将其替换为带有换行符的/a标签两个 sed 都会在一行中为您提供每个 url,但是存在垃圾,所以
  4. 第二个 grep href 清理混乱
  5. sort 和 uniq 将为您提供 sourcepage.html 中存在的每个现有 url 的一个实例