如何使用 python HTMLParser 库从特定 div 标签中提取数据?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3276040/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 09:21:52  来源:igfitidea点击:

How can I use the python HTMLParser library to extract data from a specific div tag?

pythonhtmlparsinghtml-parsing

提问by Martin

I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this html element:

我正在尝试使用 python HTMLParser 库从 HTML 页面中获取值。我想要掌握的值在这个 html 元素中:

...
<div id="remository">20</div>
...

This is my HTMLParser class so far:

到目前为止,这是我的 HTMLParser 类:

class LinksParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.seen = {}

  def handle_starttag(self, tag, attributes):
    if tag != 'div': return
    for name, value in attributes:
    if name == 'id' and value == 'remository':
      #print value
      return

  def handle_data(self, data):
    print data


p = LinksParser()
f = urllib.urlopen("http://domain.com/somepage.html")
html = f.read()
p.feed(html)
p.close()

Can someone point me in the right direction? I want the class functionality to get the value 20.

有人可以指出我正确的方向吗?我希望类功能获得值 20。

采纳答案by Alex Martelli

class LinksParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.recording = 0
    self.data = []

  def handle_starttag(self, tag, attributes):
    if tag != 'div':
      return
    if self.recording:
      self.recording += 1
      return
    for name, value in attributes:
      if name == 'id' and value == 'remository':
        break
    else:
      return
    self.recording = 1

  def handle_endtag(self, tag):
    if tag == 'div' and self.recording:
      self.recording -= 1

  def handle_data(self, data):
    if self.recording:
      self.data.append(data)

self.recordingcounts the number of nested divtags starting from a "triggering" one. When we're in the sub-tree rooted in a triggering tag, we accumulate the data in self.data.

self.recordingdiv从“触发”标记开始计算嵌套标记的数量。当我们在以触发标签为根的子树中时,我们将数据累积到self.data.

The data at the end of the parse are left in self.data(a list of strings, possibly empty if no triggering tag was met). Your code from outside the class can access the list directly from the instance at the end of the parse, or you can add appropriate accessor methods for the purpose, depending on what exactly is your goal.

解析结束时的数据保留在self.data(字符串列表,如果未满足触发标记,则可能为空)。您在类外部的代码可以直接从解析结束时的实例访问列表,或者您可以为此目的添加适当的访问器方法,具体取决于您的目标。

The class could be easily made a bit more general by using, in lieu of the constant literal strings seen in the code above, 'div', 'id', and 'remository', instance attributes self.tag, self.attnameand self.attvalue, set by __init__from arguments passed to it -- I avoided that cheap generalization step in the code above to avoid obscuring the core points (keep track of a count of nested tags and accumulate data into a list when the recording state is active).

这个类可以很容易地取得多一点的一般使用,以代替在代码中看到的常量文字字符串上面,'div''id',和'remository',实例的属性self.tagself.attname并且self.attvalue,通过设置__init__从传递给它的参数-我避免了廉价的推广步骤在上面的代码中,以避免模糊核心点(跟踪嵌套标签的计数并在记录状态处于活动状态时将数据累积到列表中)。

回答by pshirishreddy

Little correction at Line 3

第 3 行的小修正

HTMLParser.HTMLParser.__init__(self)

HTMLParser.HTMLParser.__init__(self)

it should be

它应该是

HTMLParser.__init__(self)

HTMLParser.__init__(self)

The following worked for me though

虽然以下对我有用

import urllib2 

from HTMLParser import HTMLParser  

class MyHTMLParser(HTMLParser):

  def __init__(self):
    HTMLParser.__init__(self)
    self.recording = 0 
    self.data = []
  def handle_starttag(self, tag, attrs):
    if tag == 'required_tag':
      for name, value in attrs:
        if name == 'somename' and value == 'somevale':
          print name, value
          print "Encountered the beginning of a %s tag" % tag 
          self.recording = 1 


  def handle_endtag(self, tag):
    if tag == 'required_tag':
      self.recording -=1 
      print "Encountered the end of a %s tag" % tag 

  def handle_data(self, data):
    if self.recording:
      self.data.append(data)

 p = MyHTMLParser()
 f = urllib2.urlopen('http://www.someurl.com')
 html = f.read()
 p.feed(html)
 print p.data
 p.close()

`

`

回答by modzello86

Have You tried BeautifulSoup?

你试过BeautifulSoup吗?

from bs4 import BeautifulSoup
soup = BeautifulSoup('<div id="remository">20</div>')
tag=soup.div
print(tag.string)

This gives You 20on output.

这给了你20输出。

回答by helu

This works perfectly:

这完美地工作:

print (soup.find('the tag').text)