仅使用 sed 或 awk 从 html 页面中提取 url 的最简单方法

Question

提问by codaddict

I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.

我想从 html 文件的锚标记中提取 URL。这需要使用 SED/AWK 在 BASH 中完成。请不要perl。

What is the easiest way to do this?

什么是最简单的方法来做到这一点？

Answer 1

回答by Hardy

You could also do something like this (provided you have lynx installed)...

你也可以做这样的事情（假设你安装了lynx）......

Lynx versions < 2.8.8

Lynx 版本 < 2.8.8

lynx -dump -listonly my.html

Lynx versions >= 2.8.8 (courtesy of @condit)

Lynx 版本 >= 2.8.8（由 @condit 提供）

lynx -dump -hiddenlinks=listonly my.html

Answer 2

回答by Greg Bacon

You asked for it:

你自找的：

$ wget -O - http://stackoverflow.com | \
  grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

这是一个粗略的工具，因此所有关于尝试使用正则表达式解析 HTML 的常见警告都适用。

Answer 3

回答by kerkael

grep "<a href=" sourcepage.html
  |sed "s/<a href/\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq

The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
The first sed will add a newline in front of each a hrefurl tag with the \n
The second sed will shorten each url after the 2nd " in the line by replacing it with the /atag with a newline Both seds will give you each url on a single line, but there is garbage, so
The 2nd grep href cleans the mess up
The sort and uniq will give you one instance of each existing url present in the sourcepage.html

第一个 grep 查找包含 url 的行。如果您只想查看本地页面，则可以在之后添加更多元素，因此不需要 http，而是相对路径。
第一个 sed 将在每个带有 \n的 hrefurl 标签前添加一个换行符
第二个 sed 将缩短行中第 2 个 " 之后的每个 url，将其替换为带有换行符的/a标记。两个 sed 都会在一行中为您提供每个 url，但是存在垃圾，所以
第二个 grep href 清理混乱
sort 和 uniq 将为您提供 sourcepage.html 中存在的每个现有 url 的一个实例

Answer 4

回答by Ingo Karkat

With the Xidel - HTML/XML data extraction tool, this can be done via:

使用Xidel - HTML/XML 数据提取工具，可以通过以下方式完成：

$ xidel --extract "//a/@href" http://example.com/

With conversion to absolute URLs:

转换为绝对 URL：

$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/

Answer 5

回答by ghostdog74

An example, since you didn't provide any sample

一个例子，因为你没有提供任何样本

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=2/,"",$o)
      gsub(/2.*/,"",$o)
      print $(o)
    }
  }
}' index.html

Answer 6

回答by Crisboot

I made a few changes to Greg Bacon Solution

我对 Greg Bacon Solution 进行了一些更改

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

This fixes two problems:

这解决了两个问题：

We are matching cases where the anchor doesn't start with href as first attribute
We are covering the possibility of having several anchors in the same line

我们正在匹配锚不以 href 作为第一个属性开头的情况
我们正在讨论在同一条线上有多个锚点的可能性

Answer 7

回答by Alok Singhal

I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.

我假设您想从一些 HTML 文本中提取一个 URL，而不是解析 HTML（正如其中一条评论所暗示的那样）。信不信由你，已经有人这样做了。

OT: The sed websitehas a lotof good information and many interesting/crazy sed scripts. You can even play Sokobanin sed!

OT：sed 网站有很多很好的信息和许多有趣/疯狂的 sed 脚本。你甚至可以在 sed 中玩推箱子！

Answer 8

回答by nes1983

You can do it quite easily with the following regex, which is quite good at finding URLs:

您可以使用以下正则表达式轻松完成此操作，它非常适合查找 URL：

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

I took it from John Gruber's article on how to find URLs in text.

我从John Gruber 的关于如何在文本中查找 URL 的文章中获取了它。

That lets you find all URLs in a file f.html as follows:

这使您可以在文件 f.html 中找到所有 URL，如下所示：

cat f.html | grep -o \
    -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'

Answer 9

回答by Brad Parks

In bash, the following should work. Note that it doesn't use sed or awk, but uses trand grep, both very standard and not perl ;-)

在 bash 中，以下应该有效。请注意，它不使用 sed 或 awk，而是使用trand grep，两者都非常标准，而不是 perl ;-)

$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

for example:

例如：

$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

generates

产生

//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...

Answer 10

回答by Brad Parks

Go over with a first pass replacing the start of the urls (http) with a newline (\nhttp). Then you have guaranteed for yourself that your link starts at the beginning of the line and is the only URL on the line.

The rest should be easy, here is an example:

进行第一遍，用换行符 ( \nhttp)替换 url (http) 的开头。然后您自己保证您的链接从行的开头开始，并且是该行中唯一的 URL。

剩下的应该很简单，这是一个例子：

sed "s/http/\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/$^http[s]*:[a-Z0-9/.=?_-]*$$.*$/\1/p"

alias lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/$^http[s]*:[a-Z0-9/.=?_-]*$$.*$/\1/p"; }; _'

仅使用 sed 或 awk 从 html 页面中提取 url 的最简单方法

提问by codaddict

回答by Hardy

回答by Greg Bacon

回答by kerkael

回答by Ingo Karkat

回答by ghostdog74

回答by Crisboot

回答by Alok Singhal

回答by nes1983

回答by Brad Parks

回答by Brad Parks

相关推荐

最近更新

标签

仅使用 sed 或 awk 从 html 页面中提取 url 的最简单方法

提问by codaddict

回答by Hardy

回答by Greg Bacon

回答by kerkael

回答by Ingo Karkat

回答by ghostdog74

回答by Crisboot

回答by Alok Singhal

回答by nes1983

回答by Brad Parks

回答by Brad Parks

相关推荐

HTML 中按钮的隐藏属性

Html 制作宽桌适合引导容器

Html 如何防止 <form></form> 中的换行符/换行符？

Html 如何将包裹在 SPAN 标签中的 img 居中对齐？

相关推荐

最近更新

标签