你能提供解析 HTML 的例子吗？

Question

提问by Chas. Owens

How do you parse HTML with a variety of languages and parsing libraries?

如何用多种语言和解析库解析 HTML？

When answering:

回答时：

Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.

有关如何使用正则表达式解析 HTML 作为展示正确做事方式的问题的答案中将链接到单个评论。

For the sake of consistency, I ask that the example be parsing an HTML file for the hrefin anchor tags. To make it easy to search this question, I ask that you follow this format

为了保持一致性，我要求该示例为hrefin 锚标记解析 HTML 文件。为了方便搜索这个问题，我要求你遵循这个格式

Language: [language name]

语言：[语言名称]

Library: [library name]

图书馆：[图书馆名称]

[example code]

Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:

请将图书馆设为图书馆文档的链接。如果您想提供除提取链接以外的示例，还请包括：

Purpose: [what the parse does]

目的：[解析的作用]

Answer 1

回答by Ward Werbrouck

Language: JavaScript
Library: jQuery

语言：JavaScript
库：jQuery

$.each($('a[href]'), function(){
    console.debug(this.href);
});

(using firebug console.debug for output...)

（使用 firebug console.debug 输出...）

And loading any html page:

并加载任何 html 页面：

$.get('http://stackoverflow.com/', function(page){
     $(page).find('a[href]').each(function(){
        console.debug(this.href);
    });
});

Used another each function for this one, I think it's cleaner when chaining methods.

为这个使用了另一个 each 函数，我认为链接方法时它更干净。

Answer 2

回答by alexn

Language: C#
Library: HtmlAgilityPack

语言：C#
库：HtmlAgilityPack

class Program
{
    static void Main(string[] args)
    {
        var web = new HtmlWeb();
        var doc = web.Load("http://www.stackoverflow.com");

        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");

        foreach (var node in nodes)
        {
            Console.WriteLine(node.InnerHtml);
        }
    }
}

Answer 3

回答by Paolo Bergantino

language: Python
library: BeautifulSoup

语言：Python
库：BeautifulSoup

from BeautifulSoup import BeautifulSoup

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links

output:

输出：

[<a href="http://foo.com">foo</a>,
 <a href="http://bar.com">bar</a>,
 <a href="http://baz.com">baz</a>]

also possible:

也可能：

for link in links:
    print link['href']

output:

输出：

http://foo.com
http://bar.com
http://baz.com

Answer 4

回答by draegtun

Language: Perl
Library: pQuery

语言：Perl
库：pQuery

use strict;
use warnings;
use pQuery;

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

pQuery( $html )->find( 'a' )->each(
    sub {  
        my $at = $_->getAttribute( 'href' ); 
        print "$at\n" if defined $at;
    }
);

Answer 5

回答by draegtun

language: shell
library: lynx(well, it's not library, but in shell, every program is kind-of library)

语言：shell
库：lynx（好吧，它不是库，但在 shell 中，每个程序都是某种库）

lynx -dump -listonly http://news.google.com/

Answer 6

回答by Pesto

language: Ruby
library: Hpricot

语言：Ruby
库：Hpricot

#!/usr/bin/ruby

require 'hpricot'

html = '<html><body>'
['foo', 'bar', 'baz'].each {|link| html += "<a href=\"http://#{link}.com\">#{link}</a>" }
html += '</body></html>'

doc = Hpricot(html)
doc.search('//a').each {|elm| puts elm.attributes['href'] }

Answer 7

回答by Chas. Owens

language: Python
library: HTMLParser

语言：Python
库：HTMLParser

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindLinks(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        at = dict(attrs)
        if tag == 'a' and 'href' in at:
            print at['href']


find = FindLinks()

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

find.feed(html)

Answer 8

回答by Chas. Owens

language: Perl
library: HTML::Parser

语言：Perl
库：HTML::Parser

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;

my $find_links = HTML::Parser->new(
    start_h => [
        sub {
            my ($tag, $attr) = @_;
            if ($tag eq 'a' and exists $attr->{href}) {
                print "$attr->{href}\n";
            }
        }, 
        "tag, attr"
    ]
);

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

$find_links->parse($html);

Answer 9

回答by Chas. Owens

Language Perl
Library: HTML::LinkExtor

语言 Perl
库：HTML::LinkExtor

Beauty of Perl is that you have modules for very specific tasks. Like link extraction.

Perl 的美妙之处在于您拥有用于非常特定任务的模块。像链接提取。

Whole program:

整个程序：

#!/usr/bin/perl -w
use strict;

use HTML::LinkExtor;
use LWP::Simple;

my $url     = 'http://www.google.com/';
my $content = get( $url );

my $p       = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );

exit;

sub process_link {
    my ( $tag, %attr ) = @_;

    return unless $tag eq 'a';
    return unless defined $attr{ 'href' };

    print "- $attr{'href'}\n";
    return;
}

Explanation:

解释：

use strict - turns on "strict" mode - eases potential debugging, not fully relevant to the example
use HTML::LinkExtor - load of interesting module
use LWP::Simple - just a simple way to get some html for tests
my $url = 'http://www.google.com/' - which page we will be extracting urls from
my $content = get( $url ) - fetches page html
my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
$p->parse( $content ) - pretty obvious I guess
exit - end of program
sub process_link - begin of function process_link
my ($tag, %attr) - get arguments, which are tag name, and its atributes
return unless $tag eq 'a' - skip processing if the tag is not <a>
return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
print "- $attr{'href'}\n"; - pretty obvious I guess :)
return; - finish the function

使用严格 - 打开“严格”模式 - 简化潜在的调试，与示例不完全相关
使用 HTML::LinkExtor - 加载有趣的模块
使用 LWP::Simple - 获取一些 html 进行测试的简单方法
my $url = ' http://www.google.com/' - 我们将从哪个页面提取 url
我的 $content = get( $url ) - 获取页面 html
my $p = HTML::LinkExtor->new( \&process_link, $url ) - 创建 LinkExtor 对象，给它引用将用作每个 url 回调的函数，以及 $url 用作相对 url 的 BASEURL
$p->parse( $content ) - 我猜很明显
退出 - 程序结束
sub process_link - 函数 process_link 的开始
my ($tag, %attr) - 获取参数，即标签名称及其属性
返回除非 $tag eq 'a' - 如果标签不是 <a> 则跳过处理
返回除非定义 $attr{'href'} - 如果 <a> 标签没有 href 属性，则跳过处理
打印 "- $attr{'href'}\n"; - 很明显我猜:)
返回; - 完成功能

That's all.

就这样。

Answer 10

回答by Jules Glegg

Language: Ruby
Library: Nokogiri

语言：Ruby
库：Nokogiri

#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'

document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"

你能提供解析 HTML 的例子吗？

提问by Chas. Owens

回答by Ward Werbrouck

回答by alexn

回答by Paolo Bergantino

回答by draegtun

回答by draegtun

回答by Pesto

回答by Chas. Owens

回答by Chas. Owens

回答by Chas. Owens

回答by Jules Glegg

相关推荐

最近更新

标签

你能提供解析 HTML 的例子吗？

提问by Chas. Owens

回答by Ward Werbrouck

回答by alexn

回答by Paolo Bergantino

回答by draegtun

回答by draegtun

回答by Pesto

回答by Chas. Owens

回答by Chas. Owens

回答by Chas. Owens

回答by Jules Glegg

相关推荐

Html 如何在悬停时为元素设置动画

Html 如何使 DIV 不换行？

Html li 元素的完全对齐

Html 如何为列表样式类型的自动生成数字着色？

相关推荐

最近更新

标签