仅使用 sed 或 awk 从 html 页面中提取 url 的最简单方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1881237/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Easiest way to extract the urls from an html page using sed or awk only
提问by codaddict
I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.
我想从 html 文件的锚标记中提取 URL。这需要使用 SED/AWK 在 BASH 中完成。请不要perl。
What is the easiest way to do this?
什么是最简单的方法来做到这一点?
回答by Hardy
You could also do something like this (provided you have lynx installed)...
你也可以做这样的事情(假设你安装了lynx)......
Lynx versions < 2.8.8
Lynx 版本 < 2.8.8
lynx -dump -listonly my.html
Lynx versions >= 2.8.8 (courtesy of @condit)
Lynx 版本 >= 2.8.8(由 @condit 提供)
lynx -dump -hiddenlinks=listonly my.html
回答by Greg Bacon
You asked for it:
你自找的:
$ wget -O - http://stackoverflow.com | \
grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'
This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.
这是一个粗略的工具,因此所有关于尝试使用正则表达式解析 HTML 的常见警告都适用。
回答by kerkael
grep "<a href=" sourcepage.html
|sed "s/<a href/\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
- The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
- The first sed will add a newline in front of each a hrefurl tag with the \n
- The second sed will shorten each url after the 2nd " in the line by replacing it with the /atag with a newline Both seds will give you each url on a single line, but there is garbage, so
- The 2nd grep href cleans the mess up
- The sort and uniq will give you one instance of each existing url present in the sourcepage.html
- 第一个 grep 查找包含 url 的行。如果您只想查看本地页面,则可以在之后添加更多元素,因此不需要 http,而是相对路径。
- 第一个 sed 将在每个带有 \n的 hrefurl 标签前添加一个换行符
- 第二个 sed 将缩短行中第 2 个 " 之后的每个 url,将其替换为带有换行符的/a标记。两个 sed 都会在一行中为您提供每个 url,但是存在垃圾,所以
- 第二个 grep href 清理混乱
- sort 和 uniq 将为您提供 sourcepage.html 中存在的每个现有 url 的一个实例
回答by Ingo Karkat
With the Xidel - HTML/XML data extraction tool, this can be done via:
使用Xidel - HTML/XML 数据提取工具,可以通过以下方式完成:
$ xidel --extract "//a/@href" http://example.com/
With conversion to absolute URLs:
转换为绝对 URL:
$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/
回答by ghostdog74
An example, since you didn't provide any sample
一个例子,因为你没有提供任何样本
awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
for(o=1;o<=NF;o++){
if ( $o ~ /href/){
gsub(/.*href=2/,"",$o)
gsub(/2.*/,"",$o)
print $(o)
}
}
}' index.html
回答by Crisboot
I made a few changes to Greg Bacon Solution
我对 Greg Bacon Solution 进行了一些更改
cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'
This fixes two problems:
这解决了两个问题:
- We are matching cases where the anchor doesn't start with href as first attribute
- We are covering the possibility of having several anchors in the same line
- 我们正在匹配锚不以 href 作为第一个属性开头的情况
- 我们正在讨论在同一条线上有多个锚点的可能性
回答by Alok Singhal
I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.
我假设您想从一些 HTML 文本中提取一个 URL,而不是解析 HTML(正如其中一条评论所暗示的那样)。信不信由你,已经有人这样做了。
OT: The sed websitehas a lotof good information and many interesting/crazy sed scripts. You can even playSokobanin sed!
回答by nes1983
You can do it quite easily with the following regex, which is quite good at finding URLs:
您可以使用以下正则表达式轻松完成此操作,它非常适合查找 URL:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
I took it from John Gruber's article on how to find URLs in text.
我从John Gruber 的关于如何在文本中查找 URL 的文章中获取了它。
That lets you find all URLs in a file f.html as follows:
这使您可以在文件 f.html 中找到所有 URL,如下所示:
cat f.html | grep -o \
-E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'
回答by Brad Parks
In bash, the following should work. Note that it doesn't use sed or awk, but uses tr
and grep
, both very standard and not perl ;-)
在 bash 中,以下应该有效。请注意,它不使用 sed 或 awk,而是使用tr
and grep
,两者都非常标准,而不是 perl ;-)
$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
for example:
例如:
$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
generates
产生
//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...
回答by Brad Parks
Go over with a first pass replacing the start of the urls (http) with a newline (\n
http). Then you have guaranteed for yourself that your link starts at the beginning of the line and is the only URL on the line.
The rest should be easy, here is an example:
进行第一遍,用换行符 ( \n
http)替换 url (http) 的开头。然后您自己保证您的链接从行的开头开始,并且是该行中唯一的 URL。
剩下的应该很简单,这是一个例子:
sed "s/http/\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"
sed "s/http/\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"
alias lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"; }; _'
alias lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"; }; _'