bash 仅使用 sed 和 wget 检索链接
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9899760/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Use sed and wget to retrieve links only
提问by leeman24
What I need to do is retrieve a link through a command such as:
我需要做的是通过以下命令检索链接:
wget --quiet -O - linkname
wget --quiet -O - linkname
Then pipe it to sed to just display ONLY the links on the page not the formatting.
然后将其通过管道传输到 sed 以仅显示页面上的链接而不是格式。
What i got so far only displays lines with the all the html code along side of it.
到目前为止,我所得到的仅显示带有所有 html 代码的行。
回答by kev
You can pipe the result to grepwith -o(match-only) option:
您可以grep使用-o(仅匹配)选项将结果通过管道传输:
$ wget --quiet -O - http://stackoverflow.com | grep -o 'http://[^"]*'
To get all url inside href="...":
获取里面的所有网址href="...":
grep -oP '(?<=href=")[^"]*(?=")'
回答by leeman24
I believe this is what I was looking for.
我相信这就是我一直在寻找的。
sed -n "/href/ s/.*href=['\"]\([^'\"]*\)['\"].*//gp"
回答by kerkael
grep "<a href=" sourcepage.html
|sed "s/<a href/\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
- The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
- The first sed will add a newline in front of each a hrefurl tag with the \n
- The second sed will shorten each url after the 2nd " in the line by replacing it with the /atag with a newline Both seds will give you each url on a single line, but there is garbage, so
- The 2nd grep href cleans the mess up
- The sort and uniq will give you one instance of each existing url present in the sourcepage.html
- 第一个 grep 查找包含 url 的行。如果您只想查看本地页面,则可以在之后添加更多元素,因此不需要 http,而是相对路径。
- 第一个 sed 将在每个带有 \n的 hrefurl 标签前添加一个换行符
- 第二个 sed 将缩短行中第 2 个 " 之后的每个 url,将其替换为带有换行符的/a标签两个 sed 都会在一行中为您提供每个 url,但是存在垃圾,所以
- 第二个 grep href 清理混乱
- sort 和 uniq 将为您提供 sourcepage.html 中存在的每个现有 url 的一个实例

