Python 使用 Beautiful Soup 获取所有 HTML 标签

Question

提问by humanbeing

I am trying to get a list of all html tags from beautiful soup.

我正在尝试从美丽的汤中获取所有 html 标签的列表。

I see find all but I have to know the name of the tag before I search.

我看到 find all 但我必须在搜索之前知道标签的名称。

If there is text like

如果有类似的文字

html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>"""

How would I get a list like

我怎么会得到一个列表

list_of_tags = ["<div>", "<div>", "<div class='magical'>", "<p>"]

I know how to do this with regex, but am trying to learn BS4

我知道如何用正则表达式做到这一点，但我正在尝试学习 BS4

Answer 1

回答by alecxe

You don't have to specify any arguments to find_all()- in this case, BeautifulSoupwould find you every tag in the tree, recursively. Sample:

您不必指定任何参数find_all()- 在这种情况下，BeautifulSoup会递归地找到树中的每个标签。样本：

>>> from bs4 import BeautifulSoup
>>>
>>> html = """<div>something</div>
... <div>something else</div>
... <div class='magical'>hi there</div>
... <p>ok</p>"""
>>> soup = BeautifulSoup(html, "html.parser")
>>> [tag.name for tag in soup.find_all()]
[u'div', u'div', u'div', u'p']
>>> [str(tag) for tag in soup.find_all()]
['<div>something</div>', '<div>something else</div>', '<div class="magical">hi there</div>', '<p>ok</p>']

Answer 2

回答by Jason R Stevens CFA

I thought I'd share my solution to a very similar question for those that find themselves here, later.

我想我会为那些发现自己在这里的人分享我对一个非常相似问题的解决方案，稍后。

Example

例子

I needed to find all tags quickly but only wanted unique values. I'll use the Python calendarmodule to demonstrate.

我需要快速找到所有标签，但只想要唯一的值。我将使用 Pythoncalendar模块进行演示。

We'll generate an html calendar then parse it, finding all and only those unique tags present.

我们将生成一个 html 日历，然后对其进行解析，找到所有且仅存在的唯一标签。

The below structure is verysimilar to the above, using set comprehensions:

下面的结构与上面的非常相似，使用集合推导式：

>>> from bs4 import BeautifulSoup
>>> import calendar
>>>
>>> html_cal = calendar.HTMLCalendar().formatmonth(2020, 1)
>>> set(tag.name for tag in BeautifulSoup(html_cal, 'html.parser').find_all())
{'table', 'td', 'th', 'tr'}

Answer 3

回答by Anjan

Please try the below--

请尝试以下--

for tag in soup.findAll(True):
    print(tag.name)

Answer 4

回答by Belkacem Thiziri

Here is an efficient function that I use to parse different HTML and text documents:

这是我用来解析不同 HTML 和文本文档的有效函数：

def parse_docs(path, format, tags):
    """
    Parse the different files in path, having html or txt format, and extract the text content.
    Returns a list of strings, where every string is a text document content.
    :param path: str
    :param format: str
    :param tags: list
    :return: list
    """

    docs = []
    if format == "html":
        for document in tqdm(get_list_of_files(path)):
            # print(document)
            soup = BeautifulSoup(open(document, encoding='utf-8').read())
            text = '\n'.join([''.join(s.findAll(text=True)) for s in
                              soup.findAll(tags)])  # parse all <p>, <div>, and <h> tags
            docs.append(text)
    else:
        for document in tqdm(get_list_of_files(path)):
            text = open(document, encoding='utf-8').read()
            docs.append(text)
    return docs

a simple call: parse_docs('/path/to/folder', 'html', ['p', 'h', 'div'])will return a list of text strings.

一个简单的调用：parse_docs('/path/to/folder', 'html', ['p', 'h', 'div'])将返回一个文本字符串列表。

Python 使用 Beautiful Soup 获取所有 HTML 标签

提问by humanbeing

回答by alecxe

回答by Jason R Stevens CFA

Example

例子

回答by Anjan

回答by Belkacem Thiziri

相关推荐

最近更新

标签

Python 使用 Beautiful Soup 获取所有 HTML 标签

提问by humanbeing

回答by alecxe

回答by Jason R Stevens CFA

Example

例子

回答by Anjan

回答by Belkacem Thiziri

相关推荐

如何在 Python 中修改 datetime.datetime.hour？

在 Python 中创建和写入 pdf 文件

Python beautifulsoup：bs4.element.ResultSet 对象或列表上的 find_all？

pip 配置了需要 TLS/SSL 的位置，但是 Python 中的 ssl 模块不可用

相关推荐

最近更新

标签