bash 仅使用 sed 和 wget 检索链接

Question

提问by leeman24

What I need to do is retrieve a link through a command such as:

我需要做的是通过以下命令检索链接：

wget --quiet -O - linkname

Then pipe it to sed to just display ONLY the links on the page not the formatting.

然后将其通过管道传输到 sed 以仅显示页面上的链接而不是格式。

What i got so far only displays lines with the all the html code along side of it.

到目前为止，我所得到的仅显示带有所有 html 代码的行。

Answer 1

You can pipe the result to grepwith -o(match-only) option:

您可以grep使用-o（仅匹配）选项将结果通过管道传输：

$ wget --quiet -O - http://stackoverflow.com | grep -o 'http://[^"]*'

To get all url inside href="...":

获取里面的所有网址href="..."：

grep -oP '(?<=href=")[^"]*(?=")'

Answer 2

I believe this is what I was looking for.

我相信这就是我一直在寻找的。

sed -n "/href/ s/.*href=['\"]\([^'\"]*\)['\"].*//gp"

Answer 3

grep "<a href=" sourcepage.html
  |sed "s/<a href/\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq

The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
The first sed will add a newline in front of each a hrefurl tag with the \n
The second sed will shorten each url after the 2nd " in the line by replacing it with the /atag with a newline Both seds will give you each url on a single line, but there is garbage, so
The 2nd grep href cleans the mess up
The sort and uniq will give you one instance of each existing url present in the sourcepage.html