如何在 Perl 中从 HTML 中提取 URL 和链接文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/254345/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 22:38:00  来源:igfitidea点击:

How can I extract URL and link text from HTML in Perl?

htmlperlparsingurlcpan

提问by Andy Lester

I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.

我之前问过如何在 Groovy 中做到这一点。但是,由于所有的 CPAN 库,现在我正在用 Perl 重写我的应用程序。

If the page contained these links:

如果页面包含这些链接:

<a href="http://www.google.com">Google</a>

<a href="http://www.apple.com">Apple</a>

The output would be:

输出将是:

Google, http://www.google.com
Apple, http://www.apple.com

What is the best way to do this in Perl?

在 Perl 中执行此操作的最佳方法是什么?

回答by Andy Lester

Please look at using the WWW::Mechanizemodule for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.

请查看为此使用WWW::Mechanize模块。它将为您获取您的网页,然后为您提供易于使用的 URL 列表。

my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
    printf "%s, %s\n", $link->text, $link->url;
}

Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.

非常简单,如果您想导航到该页面上的其他 URL,那就更简单了。

Mech is basically a browser in an object.

Mech 基本上是一个对象中的浏览器。

回答by Sherm Pendley

Have a look at HTML::LinkExtractorand HTML::LinkExtor, part of the HTML::Parserpackage.

看看HTML::LinkExtractorHTML::LinkExtor,它们是HTML::Parser包的一部分。

HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

HTML::LinkExtractor 与 HTML::LinkExtor 类似,除了获取 URL 之外,您还获取链接文本。

回答by Aaron Graves

If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):

如果您喜欢冒险并且想尝试不使用模块,这样的事情应该可以工作(根据您的需要进行调整):

#!/usr/bin/perl

if($#ARGV < 0) {
  print "
use pQuery;

pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
    sub {
        say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
    }
);
: Need URL argument.\n"; exit 1; } my @content = split(/\n/,`wget -qO- $ARGV[0]`); my @links = grep(/<a.*href=.*>/,@content); foreach my $c (@links){ $c =~ /<a.*href="([\s\S]+?)".*>/; $link = ; $c =~ /<a.*href.*>([\s\S]+?)<\/a>/; $title = ; print "$title, $link\n"; }

There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).

我在这里可能做错了一些事情,但它在我编写后尝试的少数测试用例中有效(它不考虑 <img> 标签等)。

回答by draegtun

I like using pQueryfor things like this...

我喜欢用pQuery做这样的事情......

  my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
  my $nodes=$tree->findnodes(q{//map[@name='map1']/area});
  while (my $node=$nodes->shift) {
    my $t=$node->attr('title');
  }

Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Pythonfor similar answers.

还可以查看之前的 stackoverflow.com 问题Emulation of lex likefunctional in Perl 或 Python以获得类似的答案。

回答by Alexandr Ciornii

Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.

另一种方法是使用 XPath 来查询已解析的 HTML。在复杂情况下需要它,例如提取具有特定类的 div 中的所有链接。为此使用 HTML::TreeBuilder::XPath。

use XML::LibXML;

my $doc = XML::LibXML->load_html(IO => \*DATA);
for my $anchor ( $doc->findnodes("//a[\@href]") )
{
    printf "%15s -> %s\n",
        $anchor->textContent,
        $anchor->getAttribute("href");
}

__DATA__
<html><head><title/></head><body>
<a href="http://www.google.com">Google</a>
<a href="http://www.apple.com">Apple</a>
</body></html>

回答by Ashley

Previous answers were perfectly good and I know I'm late to the party but this got bumped in the [perl] feed so…

以前的答案非常好,我知道我参加聚会迟到了,但这在 [perl] 提要中遇到了问题,所以……

XML::LibXMLis excellent for HTML parsing and unbeatable for speed. Set recoveroption when parsing badly formed HTML.

XML::LibXML非常适合 HTML 解析,速度无与伦比。recover解析格式错误的 HTML 时设置选项。

     Google -> http://www.google.com
      Apple -> http://www.apple.com

–yields–

–产量–

 use HTML::LinkExtractor;
 my $input = q{If <a href="http://apple.com/"> Apple </a>}; #HTML string
 my $LX = new HTML::LinkExtractor(undef,undef,1);
 $LX->parse($input);
 for my $Link( @{ $LX->links } ) {
        if( $$Link{_TEXT}=~ m/Apple/ ) {
            print "\n LinkText $$Link{_TEXT} URL $$Link{href}\n";
        }
    }

回答by cjm

Shermrecommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.

Sherm推荐了HTML::LinkExtor,这几乎就是您想要的。不幸的是,它不能返回 <a> 标签内的文本。

Andyrecommended WWW::Mechanize. That's probably the best solution.

安迪推荐了WWW::Mechanize。这可能是最好的解决方案。

If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.

如果您发现 WWW::Mechanize 不符合您的喜好,请尝试HTML::TreeBuilder。它将从 HTML 中构建一个类似 DOM 的树,然后您可以搜索所需的链接并提取附近的任何您想要的内容。

回答by ysth

Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.

或者考虑增强 HTML::LinkExtor 以执行您想要的操作,并将更改提交给作者。

回答by user13107

HTML::LinkExtractoris better than HTML::LinkExtor

HTML::LinkExtractor优于 HTML::LinkExtor

It can give both link text and URL.

它可以提供链接文本和 URL。

Usage:

用法:

##代码##

回答by converter42

HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.

HTML 是一种结构化标记语言,必须对其进行解析才能准确无误地提取其含义。列出的模块 Sherm 将解析 HTML 并为您提取链接。如果您知道您的输入总是以相同的方式形成(不要忘记属性),那么基于特殊正则表达式的解决方案可能是可以接受的,但解析器几乎总是处理结构化文本的正确答案。