你能提供解析 HTML 的例子吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/773340/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 23:36:56  来源:igfitidea点击:

Can you provide examples of parsing HTML?

htmllanguage-agnostichtml-parsing

提问by Chas. Owens

How do you parse HTML with a variety of languages and parsing libraries?

如何用多种语言和解析库解析 HTML?



When answering:

回答时:

Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.

有关如何使用正则表达式解析 HTML 作为展示正确做事方式的问题的答案中将链接到单个评论。

For the sake of consistency, I ask that the example be parsing an HTML file for the hrefin anchor tags. To make it easy to search this question, I ask that you follow this format

为了保持一致性,我要求该示例为hrefin 锚标记解析 HTML 文件。为了方便搜索这个问题,我要求你遵循这个格式

Language: [language name]

语言:[语言名称]

Library: [library name]

图书馆:[图书馆名称]

[example code]

Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:

请将图书馆设为图书馆文档的链接。如果您想提供除提取链接以外的示例,还请包括:

Purpose: [what the parse does]

目的:[解析的作用]

回答by Ward Werbrouck

Language: JavaScript
Library: jQuery

语言:JavaScript
库:jQuery

$.each($('a[href]'), function(){
    console.debug(this.href);
});

(using firebug console.debug for output...)

(使用 firebug console.debug 输出...)

And loading any html page:

并加载任何 html 页面:

$.get('http://stackoverflow.com/', function(page){
     $(page).find('a[href]').each(function(){
        console.debug(this.href);
    });
});

Used another each function for this one, I think it's cleaner when chaining methods.

为这个使用了另一个 each 函数,我认为链接方法时它更干净。

回答by alexn

Language: C#
Library: HtmlAgilityPack

语言:C#
库:HtmlAgilityPack

class Program
{
    static void Main(string[] args)
    {
        var web = new HtmlWeb();
        var doc = web.Load("http://www.stackoverflow.com");

        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");

        foreach (var node in nodes)
        {
            Console.WriteLine(node.InnerHtml);
        }
    }
}

回答by Paolo Bergantino

language: Python
library: BeautifulSoup

语言:Python
库:BeautifulSoup

from BeautifulSoup import BeautifulSoup

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links  

output:

输出:

[<a href="http://foo.com">foo</a>,
 <a href="http://bar.com">bar</a>,
 <a href="http://baz.com">baz</a>]

also possible:

也可能:

for link in links:
    print link['href']

output:

输出:

http://foo.com
http://bar.com
http://baz.com

回答by draegtun

Language: Perl
Library: pQuery

语言:Perl
库:pQuery

use strict;
use warnings;
use pQuery;

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

pQuery( $html )->find( 'a' )->each(
    sub {  
        my $at = $_->getAttribute( 'href' ); 
        print "$at\n" if defined $at;
    }
);

回答by draegtun

language: shell
library: lynx(well, it's not library, but in shell, every program is kind-of library)

语言:shell
库:lynx(好吧,它不是库,但在 shell 中,每个程序都是某种库)

lynx -dump -listonly http://news.google.com/

回答by Pesto

language: Ruby
library: Hpricot

语言:Ruby
库:Hpricot

#!/usr/bin/ruby

require 'hpricot'

html = '<html><body>'
['foo', 'bar', 'baz'].each {|link| html += "<a href=\"http://#{link}.com\">#{link}</a>" }
html += '</body></html>'

doc = Hpricot(html)
doc.search('//a').each {|elm| puts elm.attributes['href'] }

回答by Chas. Owens

language: Python
library: HTMLParser

语言:Python
库:HTMLParser

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindLinks(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        at = dict(attrs)
        if tag == 'a' and 'href' in at:
            print at['href']


find = FindLinks()

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

find.feed(html)

回答by Chas. Owens

language: Perl
library: HTML::Parser

语言:Perl
库:HTML::Parser

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;

my $find_links = HTML::Parser->new(
    start_h => [
        sub {
            my ($tag, $attr) = @_;
            if ($tag eq 'a' and exists $attr->{href}) {
                print "$attr->{href}\n";
            }
        }, 
        "tag, attr"
    ]
);

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

$find_links->parse($html);

回答by Chas. Owens

Language Perl
Library: HTML::LinkExtor

语言 Perl
库:HTML::LinkExtor

Beauty of Perl is that you have modules for very specific tasks. Like link extraction.

Perl 的美妙之处在于您拥有用于非常特定任务的模块。像链接提取。

Whole program:

整个程序:

#!/usr/bin/perl -w
use strict;

use HTML::LinkExtor;
use LWP::Simple;

my $url     = 'http://www.google.com/';
my $content = get( $url );

my $p       = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );

exit;

sub process_link {
    my ( $tag, %attr ) = @_;

    return unless $tag eq 'a';
    return unless defined $attr{ 'href' };

    print "- $attr{'href'}\n";
    return;
}

Explanation:

解释:

  • use strict - turns on "strict" mode - eases potential debugging, not fully relevant to the example
  • use HTML::LinkExtor - load of interesting module
  • use LWP::Simple - just a simple way to get some html for tests
  • my $url = 'http://www.google.com/' - which page we will be extracting urls from
  • my $content = get( $url ) - fetches page html
  • my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
  • $p->parse( $content ) - pretty obvious I guess
  • exit - end of program
  • sub process_link - begin of function process_link
  • my ($tag, %attr) - get arguments, which are tag name, and its atributes
  • return unless $tag eq 'a' - skip processing if the tag is not <a>
  • return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
  • print "- $attr{'href'}\n"; - pretty obvious I guess :)
  • return; - finish the function
  • 使用严格 - 打开“严格”模式 - 简化潜在的调试,与示例不完全相关
  • 使用 HTML::LinkExtor - 加载有趣的模块
  • 使用 LWP::Simple - 获取一些 html 进行测试的简单方法
  • my $url = ' http://www.google.com/' - 我们将从哪个页面提取 url
  • 我的 $content = get( $url ) - 获取页面 html
  • my $p = HTML::LinkExtor->new( \&process_link, $url ) - 创建 LinkExtor 对象,给它引用将用作每个 url 回调的函数,以及 $url 用作相对 url 的 BASEURL
  • $p->parse( $content ) - 我猜很明显
  • 退出 - 程序结束
  • sub process_link - 函数 process_link 的开始
  • my ($tag, %attr) - 获取参数,即标签名称及其属性
  • 返回除非 $tag eq 'a' - 如果标签不是 <a> 则跳过处理
  • 返回除非定义 $attr{'href'} - 如果 <a> 标签没有 href 属性,则跳过处理
  • 打印 "- $attr{'href'}\n"; - 很明显我猜:)
  • 返回; - 完成功能

That's all.

就这样。

回答by Jules Glegg

Language: Ruby
Library: Nokogiri

语言:Ruby
库:Nokogiri

#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'

document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"