Python 使用 Beautiful Soup 获取所有 HTML 标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36108621/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:25:36  来源:igfitidea点击:

Get all HTML tags with Beautiful Soup

pythonhtmlbeautifulsoup

提问by humanbeing

I am trying to get a list of all html tags from beautiful soup.

我正在尝试从美丽的汤中获取所有 html 标签的列表。

I see find all but I have to know the name of the tag before I search.

我看到 find all 但我必须在搜索之前知道标签的名称。

If there is text like

如果有类似的文字

html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>"""

How would I get a list like

我怎么会得到一个列表

list_of_tags = ["<div>", "<div>", "<div class='magical'>", "<p>"]

I know how to do this with regex, but am trying to learn BS4

我知道如何用正则表达式做到这一点,但我正在尝试学习 BS4

回答by alecxe

You don't have to specify any arguments to find_all()- in this case, BeautifulSoupwould find you every tag in the tree, recursively. Sample:

您不必指定任何参数find_all()- 在这种情况下,BeautifulSoup会递归地找到树中的每个标签。样本:

>>> from bs4 import BeautifulSoup
>>>
>>> html = """<div>something</div>
... <div>something else</div>
... <div class='magical'>hi there</div>
... <p>ok</p>"""
>>> soup = BeautifulSoup(html, "html.parser")
>>> [tag.name for tag in soup.find_all()]
[u'div', u'div', u'div', u'p']
>>> [str(tag) for tag in soup.find_all()]
['<div>something</div>', '<div>something else</div>', '<div class="magical">hi there</div>', '<p>ok</p>']

回答by Jason R Stevens CFA

I thought I'd share my solution to a very similar question for those that find themselves here, later.

我想我会为那些发现自己在这里的人分享我对一个非常相似问题的解决方案,稍后。

Example

例子

I needed to find all tags quickly but only wanted unique values. I'll use the Python calendarmodule to demonstrate.

我需要快速找到所有标签,但只想要唯一的值。我将使用 Pythoncalendar模块进行演示。

We'll generate an html calendar then parse it, finding all and only those unique tags present.

我们将生成一个 html 日历,然后对其进行解析,找到所有且仅存在的唯一标签。

The below structure is verysimilar to the above, using set comprehensions:

下面的结构与上面的非常相似,使用集合推导式:

>>> from bs4 import BeautifulSoup
>>> import calendar
>>>
>>> html_cal = calendar.HTMLCalendar().formatmonth(2020, 1)
>>> set(tag.name for tag in BeautifulSoup(html_cal, 'html.parser').find_all())
{'table', 'td', 'th', 'tr'}

回答by Anjan

Please try the below--

请尝试以下--

for tag in soup.findAll(True):
    print(tag.name)

回答by Belkacem Thiziri

Here is an efficient function that I use to parse different HTML and text documents:

这是我用来解析不同 HTML 和文本文档的有效函数:

def parse_docs(path, format, tags):
    """
    Parse the different files in path, having html or txt format, and extract the text content.
    Returns a list of strings, where every string is a text document content.
    :param path: str
    :param format: str
    :param tags: list
    :return: list
    """

    docs = []
    if format == "html":
        for document in tqdm(get_list_of_files(path)):
            # print(document)
            soup = BeautifulSoup(open(document, encoding='utf-8').read())
            text = '\n'.join([''.join(s.findAll(text=True)) for s in
                              soup.findAll(tags)])  # parse all <p>, <div>, and <h> tags
            docs.append(text)
    else:
        for document in tqdm(get_list_of_files(path)):
            text = open(document, encoding='utf-8').read()
            docs.append(text)
    return docs

a simple call: parse_docs('/path/to/folder', 'html', ['p', 'h', 'div'])will return a list of text strings.

一个简单的调用:parse_docs('/path/to/folder', 'html', ['p', 'h', 'div'])将返回一个文本字符串列表。