Python/BeautifulSoup - 如何从元素中删除所有标签？

Question

提问by Daniele B

How can I simply strip all tags from an element I find in BeautifulSoup?

如何简单地从在 BeautifulSoup 中找到的元素中删除所有标签？

Answer 1

回答by Daniele B

it looks like this is the way to do! as simple as that

看起来这就是方法！就如此容易

with this line you are joining together the all text parts within the current element

使用这一行，您将当前元素中的所有文本部分连接在一起

''.join(htmlelement.find(text=True))

Answer 2

回答by danblack

You can use the decompose method in bs4:

你可以在 bs4 中使用分解方法：

soup = bs4.BeautifulSoup('<body><a href="http://example.com/">I linked to <i>example.com</i></a></body>')

for a in soup.find('a').children:
    if isinstance(a,bs4.element.Tag):
        a.decompose()

print soup

Out: <html><body><a href="http://example.com/">I linked to </a></body></html>

Answer 3

回答by Bobby

why has no answer I've seen mentioned anything about the unwrapmethod? Or, even easier, the get_textmethod

为什么我看到的答案没有提到有关该unwrap方法的任何内容？或者，更简单的get_text方法

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwrap http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

Answer 4

回答by shawnl

With BeautifulStoneSoupgone in bs4, it's even simpler in Python3

使用got BeautifulStoneSoupin bs4，在 Python3 中就更简单了

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

Answer 5

回答by SparkAndShine

Use get_text(), it returns all the text in a document or beneath a tag, as a single Unicode string.

使用get_text()，它将文档中或标签下的所有文本作为单个 Unicode 字符串返回。

For instance, remove all different script tags from the following text:

例如，从以下文本中删除所有不同的脚本标签：

<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>

The expected result is:

预期的结果是：

Signal et Communication
Ingénierie Réseaux et Télécommunications

Here is the source code:

这是源代码：

#!/usr/bin/env python3
from bs4 import BeautifulSoup

text = '''
<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>
'''
soup = BeautifulSoup(text)

print(soup.get_text())

Answer 6

回答by Chaitanya Mallepudi

Here is the source code: you can get the text which is exactly in the URL

这是源代码：您可以获取正好在 URL 中的文本

URL = ''
page = requests.get(URL)
soup = bs4.BeautifulSoup(page.content,'html.parser').get_text()
print(soup)

Answer 7

回答by Shaurya Sheth

Code to simply get the contents as text instead of html:

简单地将内容作为文本而不是 html 获取的代码：

'html_text'parameter is the string which you will pass in this function to get the text

'html_text'参数是您将在此函数中传递以获取文本的字符串

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'lxml')
text = soup.get_text()
print(text)

Python/BeautifulSoup - 如何从元素中删除所有标签？

提问by Daniele B

回答by Daniele B

回答by danblack

回答by Bobby

回答by shawnl

回答by SparkAndShine

回答by Chaitanya Mallepudi

回答by Shaurya Sheth

相关推荐

最近更新

标签

Python/BeautifulSoup - 如何从元素中删除所有标签？

提问by Daniele B

回答by Daniele B

回答by danblack

回答by Bobby

回答by shawnl

回答by SparkAndShine

回答by Chaitanya Mallepudi

回答by Shaurya Sheth

相关推荐

Python 使用 pandas.to_datetime 时只保留日期部分

在 Python 3 中获取“OrderedDict”第一项的最短方法

Python中的命令行输入

Python 读取 CSV 的单列并存储在数组中

相关推荐

最近更新

标签