Python：使用 html 解析器提取特定数据

Question

提问by IssnKissn

I started using the HTMLParser in Python to extract data from a website. I get everything I wanted, except the text within two tags of HTML. Here is an example of the HTML tag:

我开始使用 Python 中的 HTMLParser 从网站中提取数据。我得到了我想要的一切，除了 HTML 的两个标签内的文本。以下是 HTML 标记的示例：

<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>

There are also other tags starting with . They have other attributes and values and therefore I do not want to have their data:

还有其他以 . 开头的标签。它们具有其他属性和值，因此我不想拥有它们的数据：

<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>

The tag is an embedded tag within a table. I don't know if this makes any difference between other tags. I only want the information in some of the tags called 'a' with the attribute class="Vocabulary" and I want the data within the tag, in the example it would be "Swahili". So what I did is:

标签是表格中的嵌入标签。我不知道这是否对其他标签有任何影响。我只想要一些名为“a”的标签中的信息，属性 class="Vocabulary"，我想要标签内的数据，在示例中它是“斯瓦希里语”。所以我所做的是：

class AllLanguages(HTMLParser):
    '''
    classdocs
    '''
    #counter for the languages
    #countLanguages = 0
    def __init__(self):
        HTMLParser.__init__(self)
        self.inLink = False
        self.dataArray = []
        self.countLanguages = 0
        self.lasttag = None
        self.lastname = None
        self.lastvalue = None
        #self.text = ""


    def handle_starttag(self, tag, attr):
        #print "Encountered a start tag:", tag      
        if tag == 'a':
            for name, value in attr:
                if name == 'class' and value == 'Vocabulary':
                    self.countLanguages += 1
                    self.inLink = True
                    self.lasttag = tag
                    #self.lastname = name
                    #self.lastvalue = value
                    print self.lasttag
                    #print self.lastname
                    #print self.lastvalue
                    #return tag
                    print self.countLanguages




    def handle_endtag(self, tag):
        if tag == "a":
            self.inlink = False
            #print "".join(self.data)

    def handle_data(self, data):
        if self.lasttag == 'a' and self.inLink and data.strip():
            #self.dataArray.append(data)
            #
            print data

The programm prints every data which is included in an tag, but I only want the one included in the tag with the right attributes. How do I get this specific data?

该程序打印标签中包含的每个数据，但我只希望标签中包含具有正确属性的数据。我如何获得这些特定数据？

Answer 1

采纳答案by alecxe

Looks like you forgot to set self.inLink = Falsein handle_starttagby default:

像你看起来忘了设置self.inLink = False在handle_starttag默认情况下：

from HTMLParser import HTMLParser


class AllLanguages(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.inLink = False
        self.dataArray = []
        self.countLanguages = 0
        self.lasttag = None
        self.lastname = None
        self.lastvalue = None

    def handle_starttag(self, tag, attrs):
        self.inLink = False
        if tag == 'a':
            for name, value in attrs:
                if name == 'class' and value == 'Vocabulary':
                    self.countLanguages += 1
                    self.inLink = True
                    self.lasttag = tag

    def handle_endtag(self, tag):
        if tag == "a":
            self.inlink = False

    def handle_data(self, data):
        if self.lasttag == 'a' and self.inLink and data.strip():
            print data


parser = AllLanguages()
parser.feed("""
<html>
<head><title>Test</title></head>
<body>
<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>
<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="English" class="Vocabulary">English</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="Russian" class="Vocabulary">Russian</a>
</body>
</html>""")

prints:

印刷：

Swahili
English
Russian

Also, take a look at:

另外，看看：

Hope that helps.

希望有帮助。

Answer 2

回答by seagulf

You may try HTQL (http://htql.net). The query for:

您可以尝试 HTQL ( http://htql.net)。查询：

"the tags called 'a' with the attribute class="Vocabulary" and I want the data within the tag"

“带有属性 class="Vocabulary" 的名为 'a' 的标签，我想要标签内的数据”

is:

是：

<a (class='Vocabulary')>:tx

The python code is something like this:

python代码是这样的：

import htql
a=htql.query(page, "<a (class='Vocabulary')>:tx")
print(a)

Python：使用 html 解析器提取特定数据

提问by IssnKissn

采纳答案by alecxe

回答by seagulf

相关推荐

最近更新

标签

Python：使用 html 解析器提取特定数据

提问by IssnKissn

采纳答案by alecxe

回答by seagulf

相关推荐

Python如何同时读写一个文件

Python 保存一个 numpy 矩阵

如何从站点下载 zip 文件 (python)

“pip install”和“python -m pip install”有什么区别？

相关推荐

最近更新

标签