如何在 Bash 或 grep 或批处理中删除 HTML 文件的所有链接并将它们存储在文本文件中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21264626/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 00:38:02  来源:igfitidea点击:

How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

bashshellawkgrepcut

提问by A'sa Dickens

I have a file that is HTML, and it has about 150 anchor tags. I need only the links from these tags, AKA, <a href="*http://www.google.com*"></a>. I want to get only the http://www.google.compart.

我有一个HTML文件,它有大约 150 个锚标记。我只需要来自这些代码的链接,AKA, <a href="*http://www.google.com*"></a>。我只想获得http://www.google.com部分。

When I run a grep,

当我运行 grep 时,

cat website.htm | grep -E '<a href=".*">' > links.txt

this returns the entire line to me that it found on not the link I want, so I tried using a cutcommand:

这会将整行返回给我,它不是在我想要的链接上找到的,所以我尝试使用cut命令:

cat drawspace.txt | grep -E '<a href=".*">' | cut -d'”' --output-delimiter=$'\n' > links.txt

Except that it is wrong, and it doesn't work give me some error about wrong parameters... So I assume that the file was supposed to be passed along too. Maybe like cut -d'”' --output-delimiter=$'\n' grepedText.txt > links.txt.

除了它是错误的,它不起作用给我一些关于错误参数的错误......所以我假设该文件也应该被传递。也许喜欢cut -d'”' --output-delimiter=$'\n' grepedText.txt > links.txt

But I wanted to do this in one command if possible... So I tried doing an AWKcommand.

但如果可能的话,我想在一个命令中执行此操作...所以我尝试执行AWK命令。

cat drawspace.txt | grep '<a href=".*">' | awk '{print }'

But this wouldn't run either. It was asking me for more input, because I wasn't finished....

但这也不会运行。它要求我提供更多意见,因为我还没有完成......

I tried writing a batch file, and it told me FINDSTR is not an internal or external command... So I assume my environment variables were messed up and rather than fix that I tried installing grep on Windows, but that gave me the same error....

我尝试编写一个批处理文件,它告诉我 FINDSTR 不是内部或外部命令......所以我假设我的环境变量搞砸了,而不是修复我尝试在 Windows 上安装 grep 的问题,但这给了我同样的错误....

The question is, what is the right way to strip out the HTTP links from HTML? With that I will make it work for my situation.

问题是,从HTML 中去除HTTP 链接的正确方法是什么?有了它,我将使它适合我的情况。

P.S. I've read so many links/Stack Overflow posts that showing my references would take too long.... If example HTML is needed to show the complexity of the process then I will add it.

PS 我已经阅读了很多链接/堆栈溢出帖子,显示我的引用需要太长时间......如果需要示例 HTML 来显示过程的复杂性,那么我会添加它。

I also have a Mac and PC which I switched back and forth between them to use their shell/batch/grep command/terminal commands, so either or will help me.

我还有一台 Mac 和 PC,我在它们之间来回切换以使用它们的 shell/batch/grep 命令/终端命令,所以要么或将帮助我。

I also want to point out I'm in the correct directory

我还想指出我在正确的目录中

Enter image description here

在此处输入图片说明

HTML:

HTML:

<tr valign="top">
    <td class="beginner">
      B03&nbsp;&nbsp;
    </td>
    <td>
        <a href="http://www.drawspace.com/lessons/b03/simple-symmetry">Simple Symmetry</a>  </td>
</tr>

<tr valign="top">
  <td class="beginner">
    B04&nbsp;&nbsp;
  </td>
  <td>
      <a href="http://www.drawspace.com/lessons/b04/faces-and-a-vase">Faces and a Vase</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
      B05&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b05/blind-contour-drawing">Blind Contour Drawing</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
        B06&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b06/seeing-values">Seeing Values</a> </td>
</tr>

Expected output:

预期输出:

http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
etc.

回答by Ed Morton

$ sed -n 's/.*href="\([^"]*\).*//p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

回答by fedorqui 'SO stop harming'

You can use grepfor this:

您可以grep为此使用:

grep -Po '(?<=href=")[^"]*' file

It prints everything after href="until a new double quote appears.

它打印所有内容,href="直到出现新的双引号。

With your given input it returns:

根据您给定的输入,它返回:

http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

Note that it is not necessary to write cat drawspace.txt | grep '<a href=".*">', you can get rid of the useless use of catwith grep '<a href=".*">' drawspace.txt.

注意不用写cat drawspace.txt | grep '<a href=".*">',可以去掉catwith的无用用法grep '<a href=".*">' drawspace.txt

Another example

另一个例子

$ cat a
hello <a href="httafasdf">asdas</a>
hello <a href="hello">asdas</a>
other things

$ grep -Po '(?<=href=")[^"]*' a
httafasdf
hello

回答by Michael

My guess is your PC or Mac will not have the lynx command installed by default (it's available for free on the web), but lynx will let you do things like this:

我的猜测是你的 PC 或 Mac 默认不会安装 lynx 命令(它可以在网络上免费获得),但 lynx 会让你做这样的事情:

$lynx -dump -image_links -listonly /usr/share/xdiagnose/workloads/youtube-reload.html

$lynx -dump -image_links -listonly /usr/share/xdiagnose/workloads/youtube-reload.html

Output: References

输出:参考文献

  1. file://localhost/usr/share/xdiagnose/workloads/youtube-reload.html
  2. http://www.youtube.com/v/zeNXuC3N5TQ&hl=en&fs=1&autoplay=1
  1. file://localhost/usr/share/xdiagnose/workloads/youtube-reload.html
  2. http://www.youtube.com/v/zeNXuC3N5TQ&hl=en&fs=1&autoplay=1

It is then a simple matter to grep for the http: lines. And there even may be lynx options to print just the http: lines (lynx has many, many options).

然后,为 http: 行 grep 是一件简单的事情。甚至可能有 lynx 选项只打印 http: 行(lynx 有很多很多选项)。

回答by kvantour

As per comment of triplee, using regex to parse HTML or XML files is essentially not done. Tools such as sedand awkare extremely powerful for handling text files, but when it boils down to parsing complex-structured data — such as XML, HTML, JSON, ... — they are nothing more than a sledgehammer. Yes, you can get the job done, but sometimes at a tremendous cost. For handling such delicate files, you need a bit more finesse by using a more targetted set of tools.

根据triple 的评论,使用正则表达式解析HTML 或XML 文件基本上没有完成。诸如sed和 之awk类的工具在处理文本文件方面非常强大,但归结为解析复杂结构的数据(例如 XML、HTML、JSON 等)时,它们只不过是一把大锤。是的,您可以完成工作,但有时需要付出巨大的代价。为了处理如此精细的文件,您需要通过使用更有针对性的工具集来提高技巧。

In case of parsing XML or HTML, one can easily use xmlstarlet.

在解析 XML 或 HTML 的情况下,可以轻松地使用xmlstarlet.

In case of an XHTML file, you can use :

如果是 XHTML 文件,您可以使用:

xmlstarlet sel --html  -N "x=http://www.w3.org/1999/xhtml" \
               -t -m '//x:a/@href' -v . -n

where -Ngives the XHTML namespace if any, this is recognized by

这里-N给出了XHTML命名空间如果有,这是由认可

<html xmlns="http://www.w3.org/1999/xhtml">

However, As HTML pages are often not well-formed XML, it might be handy to clean it up a bit using tidy. In the example case above this gives then :

但是,由于 HTML 页面通常不是格式良好的 XML,因此使用tidy. 在上面的示例中,这给出了:

$ tidy -q -numeric -asxhtml --show-warnings no <file.html> \
  | xmlstarlet sel --html -N "x=http://www.w3.org/1999/xhtml" \
                   -t -m '//x:a/@href' -v . -n
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

回答by Sathish

Use grepto extract all the lines with links in them and then use sedto pull out the URLs:

使用grep提取所有与他们联系的线路,然后利用sed拉出网址:

grep -o '<a href=".*">' *.html | sed 's/\(<a href="\|\">\)//g' > link.txt;