Python HTML解析器-IGI

时间：2020-02-23 14:42:47 　来源:igfitidea点击:

PythonHTML.parser模块为我们提供了HTMLParser类，可以将其子类化以解析HTML格式的文本文件。
我们也可以使用HTTP客户端轻松修改逻辑以处理来自HTTP请求HTML。

HTMLParser的类定义如下：

class html.parser.HTMLParser(*, convert_charrefs=True)

在本程序中，我们将对HTMLParser类进行子类化，以观察其功能呈现的行为并进行操作。
让我们开始吧。

Python HTML解析器

正如我们在" HTMLParser"的类定义中看到的那样，当" convert_charrefs"的值为True时，所有字符引用(" script" /" style"元素中的字符引用除外)均转换为相应的Unicode字符。

一旦该类的实例在传递给它HTML字符串中遇到开始标记，结束标记，文本，注释和其他标记元素时，将自动调用该类的处理程序方法(我们将在下一节中看到)。

当我们想使用此类时，应该对它进行子类化以提供我们自己的功能。
在提供相同的示例之前，让我们也提及该类可用于自定义的所有功能。
他们是：

handle_startendtag：该函数通过将控件传递给其他函数来管理HTML文档的开始和结束标签，这在其定义中很明显：

def handle_startendtag(self, tag, attrs):
  self.handle_starttag(tag, attrs)
  self.handle_endtag(tag)

handle_starttag：这个函数是用来处理开始标签的：

def handle_starttag(self, tag, attrs):
  pass

handle_endtag：该函数管理HTML字符串中的结束标记：

def handle_endtag(self, tag):
  pass

handle_charref：此函数处理传递给它的字符串中的字符引用，其定义为：

def handle_charref(self, name):
  pass

handle_entityref：该函数处理实体引用，其定义为：

def handle_entityref(self, name):
  pass

handle_data：此函数管理HTML字符串中的数据，并且是此类中最重要的函数之一，其定义为：

def handle_data(self, data):
  pass

handle_comment：该函数管理HTML中的注释，其定义为：

def handle_comment(self, data):
  pass

handle_pi：此函数管理HTML中的处理指令，其定义为：

def handle_pi(self, data):
  pass

handle_decl：该函数管理HTML中的声明，其定义为：

def handle_decl(self, decl):
  pass

让我们开始提供" HTMLParser"的子类，以了解其中的一些功能。

为HTMLParser制作子类

在此示例中，我们将创建HTMLParser的子类，并了解如何调用此类的最常见处理程序方法。
这是一个子类化HTMLParser类的示例程序：

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
  def handle_starttag(self, tag, attrs):
      print("Found a start tag:", tag)

  def handle_endtag(self, tag):
      print("Found an end tag :", tag)

  def handle_data(self, data):
      print("Found some data  :", data)

parser = MyHTMLParser()
parser.feed('<title>theitroad HTMLParser</title>'
          '<h1>Python html.parse module</h1>')

覆盖HTMLParser方法

在此示例中，我们将覆盖HTMLParser类的所有功能。
让我们看一下该类的代码段：

from html.parser import HTMLParser
from html.entities import name2codepoint

class JDParser(HTMLParser):
  def handle_starttag(self, tag, attrs):
      print("Start tag:", tag)
      for attr in attrs:
          print("     attr:", attr)

  def handle_endtag(self, tag):
      print("End tag  :", tag)

  def handle_data(self, data):
      print("Data     :", data)

  def handle_comment(self, data):
      print("Comment  :", data)

  def handle_entityref(self, name):
      c = chr(name2codepoint[name])
      print("Named ent:", c)

  def handle_charref(self, name):
      if name.startswith('x'):
          c = chr(int(name[1:], 16))
      else:
          c = chr(int(name))
      print("Num ent  :", c)

  def handle_decl(self, data):
      print("Decl     :", data)

parser = JDParser()

现在，我们将使用此类来解析HTML脚本的各个部分。
这是从文档类型String开始的：

parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
           '"https://www.w3.org/TR/html4/strict.dtd">')

让我们看一下该程序的输出：

HTMLParser Doctype解析

让我们看一个传递一个img标签的代码段：

parser.feed('<img src="https://cdn.theitroad.local/wp-content/uploads/2014/05/Final-JD-Logo.png" alt="The Python logo">')

让我们看一下该程序的输出：

请注意，标签是如何断开的，标签的属性也已提取。

让我们尝试一下未解析其元素的script/style标签：

parser.feed('<script type="text/javascript">'
           'alert("theitroad Python");</script>')
parser.feed('<style type="text/css">#python { color: green }</style>')

让我们看一下该程序的输出：

使用此实例也可以解析注释：

parser.feed('<!-- This marks the beginning of samples. -->'
          '<!--[if IE 9]>IE-specific content<![endif]-->')

使用此方法，我们还可以管理许多与IE相关的属性，并查看某些网页是否支持IE：解析注释

解析命名和数字引用

这是一个示例程序，通过它我们还可以解析字符引用，并在运行时将其转换为正确的字符：

parser.feed('>>>')

让我们看一下该程序的输出：解析字符引用

解析无效HTML

在某种程度上，我们还可以提供无效HTML数据以提供功能。
这是一个示例程序，在anchor标记中的链接周围没有引号：

parser.feed('<h1><a class="link" href="#main">Invalid HTML</h1></a>')

Python HTML解析器

Python HTML解析器

为HTMLParser制作子类

覆盖HTMLParser方法

解析命名和数字引用

解析无效HTML

相关推荐

最近更新

标签

Python HTML解析器

Python HTML解析器

为HTMLParser制作子类

覆盖HTMLParser方法

解析命名和数字引用

解析无效HTML

相关推荐

Python-文件处理

Python filter()

Python在列表中查找字符串

Python Flask教程

相关推荐

最近更新

标签