你能提供解析 HTML 的例子吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/773340/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Can you provide examples of parsing HTML?
提问by Chas. Owens
How do you parse HTML with a variety of languages and parsing libraries?
如何用多种语言和解析库解析 HTML?
When answering:
回答时:
Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.
有关如何使用正则表达式解析 HTML 作为展示正确做事方式的问题的答案中将链接到单个评论。
For the sake of consistency, I ask that the example be parsing an HTML file for the hrefin anchor tags. To make it easy to search this question, I ask that you follow this format
为了保持一致性,我要求该示例为hrefin 锚标记解析 HTML 文件。为了方便搜索这个问题,我要求你遵循这个格式
Language: [language name]
语言:[语言名称]
Library: [library name]
图书馆:[图书馆名称]
[example code]
Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:
请将图书馆设为图书馆文档的链接。如果您想提供除提取链接以外的示例,还请包括:
Purpose: [what the parse does]
目的:[解析的作用]
回答by Ward Werbrouck
Language: JavaScript
Library: jQuery
语言:JavaScript
库:jQuery
$.each($('a[href]'), function(){
console.debug(this.href);
});
(using firebug console.debug for output...)
(使用 firebug console.debug 输出...)
And loading any html page:
并加载任何 html 页面:
$.get('http://stackoverflow.com/', function(page){
$(page).find('a[href]').each(function(){
console.debug(this.href);
});
});
Used another each function for this one, I think it's cleaner when chaining methods.
为这个使用了另一个 each 函数,我认为链接方法时它更干净。
回答by alexn
Language: C#
Library: HtmlAgilityPack
语言:C#
库:HtmlAgilityPack
class Program
{
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
}
回答by Paolo Bergantino
language: Python
library: BeautifulSoup
语言:Python
库:BeautifulSoup
from BeautifulSoup import BeautifulSoup
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"
soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links
output:
输出:
[<a href="http://foo.com">foo</a>,
<a href="http://bar.com">bar</a>,
<a href="http://baz.com">baz</a>]
also possible:
也可能:
for link in links:
print link['href']
output:
输出:
http://foo.com
http://bar.com
http://baz.com
回答by draegtun
Language: Perl
Library: pQuery
语言:Perl
库:pQuery
use strict;
use warnings;
use pQuery;
my $html = join '',
"<html><body>",
(map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
"</body></html>";
pQuery( $html )->find( 'a' )->each(
sub {
my $at = $_->getAttribute( 'href' );
print "$at\n" if defined $at;
}
);
回答by draegtun
回答by Pesto
回答by Chas. Owens
language: Python
library: HTMLParser
语言:Python
库:HTMLParser
#!/usr/bin/python
from HTMLParser import HTMLParser
class FindLinks(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
at = dict(attrs)
if tag == 'a' and 'href' in at:
print at['href']
find = FindLinks()
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"
find.feed(html)
回答by Chas. Owens
language: Perl
library: HTML::Parser
语言:Perl
库:HTML::Parser
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my $find_links = HTML::Parser->new(
start_h => [
sub {
my ($tag, $attr) = @_;
if ($tag eq 'a' and exists $attr->{href}) {
print "$attr->{href}\n";
}
},
"tag, attr"
]
);
my $html = join '',
"<html><body>",
(map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
"</body></html>";
$find_links->parse($html);
回答by Chas. Owens
Language Perl
Library: HTML::LinkExtor
语言 Perl
库:HTML::LinkExtor
Beauty of Perl is that you have modules for very specific tasks. Like link extraction.
Perl 的美妙之处在于您拥有用于非常特定任务的模块。像链接提取。
Whole program:
整个程序:
#!/usr/bin/perl -w
use strict;
use HTML::LinkExtor;
use LWP::Simple;
my $url = 'http://www.google.com/';
my $content = get( $url );
my $p = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );
exit;
sub process_link {
my ( $tag, %attr ) = @_;
return unless $tag eq 'a';
return unless defined $attr{ 'href' };
print "- $attr{'href'}\n";
return;
}
Explanation:
解释:
- use strict - turns on "strict" mode - eases potential debugging, not fully relevant to the example
- use HTML::LinkExtor - load of interesting module
- use LWP::Simple - just a simple way to get some html for tests
- my $url = 'http://www.google.com/' - which page we will be extracting urls from
- my $content = get( $url ) - fetches page html
- my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
- $p->parse( $content ) - pretty obvious I guess
- exit - end of program
- sub process_link - begin of function process_link
- my ($tag, %attr) - get arguments, which are tag name, and its atributes
- return unless $tag eq 'a' - skip processing if the tag is not <a>
- return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
- print "- $attr{'href'}\n"; - pretty obvious I guess :)
- return; - finish the function
- 使用严格 - 打开“严格”模式 - 简化潜在的调试,与示例不完全相关
- 使用 HTML::LinkExtor - 加载有趣的模块
- 使用 LWP::Simple - 获取一些 html 进行测试的简单方法
- my $url = ' http://www.google.com/' - 我们将从哪个页面提取 url
- 我的 $content = get( $url ) - 获取页面 html
- my $p = HTML::LinkExtor->new( \&process_link, $url ) - 创建 LinkExtor 对象,给它引用将用作每个 url 回调的函数,以及 $url 用作相对 url 的 BASEURL
- $p->parse( $content ) - 我猜很明显
- 退出 - 程序结束
- sub process_link - 函数 process_link 的开始
- my ($tag, %attr) - 获取参数,即标签名称及其属性
- 返回除非 $tag eq 'a' - 如果标签不是 <a> 则跳过处理
- 返回除非定义 $attr{'href'} - 如果 <a> 标签没有 href 属性,则跳过处理
- 打印 "- $attr{'href'}\n"; - 很明显我猜:)
- 返回; - 完成功能
That's all.
就这样。

