Python:使用 html 解析器提取特定数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/16773583/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: Extracting specific data with html parser
提问by IssnKissn
I started using the HTMLParser in Python to extract data from a website. I get everything I wanted, except the text within two tags of HTML. Here is an example of the HTML tag:
我开始使用 Python 中的 HTMLParser 从网站中提取数据。我得到了我想要的一切,除了 HTML 的两个标签内的文本。以下是 HTML 标记的示例:
<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>
There are also other tags starting with . They have other attributes and values and therefore I do not want to have their data:
还有其他以 . 开头的标签。它们具有其他属性和值,因此我不想拥有它们的数据:
<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>
The tag is an embedded tag within a table. I don't know if this makes any difference between other tags. I only want the information in some of the tags called 'a' with the attribute class="Vocabulary" and I want the data within the tag, in the example it would be "Swahili". So what I did is:
标签是表格中的嵌入标签。我不知道这是否对其他标签有任何影响。我只想要一些名为“a”的标签中的信息,属性 class="Vocabulary",我想要标签内的数据,在示例中它是“斯瓦希里语”。所以我所做的是:
class AllLanguages(HTMLParser):
    '''
    classdocs
    '''
    #counter for the languages
    #countLanguages = 0
    def __init__(self):
        HTMLParser.__init__(self)
        self.inLink = False
        self.dataArray = []
        self.countLanguages = 0
        self.lasttag = None
        self.lastname = None
        self.lastvalue = None
        #self.text = ""
    def handle_starttag(self, tag, attr):
        #print "Encountered a start tag:", tag      
        if tag == 'a':
            for name, value in attr:
                if name == 'class' and value == 'Vocabulary':
                    self.countLanguages += 1
                    self.inLink = True
                    self.lasttag = tag
                    #self.lastname = name
                    #self.lastvalue = value
                    print self.lasttag
                    #print self.lastname
                    #print self.lastvalue
                    #return tag
                    print self.countLanguages
    def handle_endtag(self, tag):
        if tag == "a":
            self.inlink = False
            #print "".join(self.data)
    def handle_data(self, data):
        if self.lasttag == 'a' and self.inLink and data.strip():
            #self.dataArray.append(data)
            #
            print data
The programm prints every data which is included in an tag, but I only want the one included in the tag with the right attributes. How do I get this specific data?
该程序打印标签中包含的每个数据,但我只希望标签中包含具有正确属性的数据。我如何获得这些特定数据?
采纳答案by alecxe
Looks like you forgot to set self.inLink = Falsein handle_starttagby default:
像你看起来忘了设置self.inLink = False在handle_starttag默认情况下:
from HTMLParser import HTMLParser
class AllLanguages(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.inLink = False
        self.dataArray = []
        self.countLanguages = 0
        self.lasttag = None
        self.lastname = None
        self.lastvalue = None
    def handle_starttag(self, tag, attrs):
        self.inLink = False
        if tag == 'a':
            for name, value in attrs:
                if name == 'class' and value == 'Vocabulary':
                    self.countLanguages += 1
                    self.inLink = True
                    self.lasttag = tag
    def handle_endtag(self, tag):
        if tag == "a":
            self.inlink = False
    def handle_data(self, data):
        if self.lasttag == 'a' and self.inLink and data.strip():
            print data
parser = AllLanguages()
parser.feed("""
<html>
<head><title>Test</title></head>
<body>
<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>
<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="English" class="Vocabulary">English</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="Russian" class="Vocabulary">Russian</a>
</body>
</html>""")
prints:
印刷:
Swahili
English
Russian
Also, take a look at:
另外,看看:
Hope that helps.
希望有帮助。
回答by seagulf
You may try HTQL (http://htql.net). The query for:
您可以尝试 HTQL ( http://htql.net)。查询:
"the tags called 'a' with the attribute class="Vocabulary" and I want the data within the tag"
“带有属性 class="Vocabulary" 的名为 'a' 的标签,我想要标签内的数据”
is:
是:
<a (class='Vocabulary')>:tx 
The python code is something like this:
python代码是这样的:
import htql
a=htql.query(page, "<a (class='Vocabulary')>:tx")
print(a)

