Python/BeautifulSoup - 如何从元素中删除所有标签?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16206380/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python/BeautifulSoup - how to remove all tags from an element?
提问by Daniele B
How can I simply strip all tags from an element I find in BeautifulSoup?
如何简单地从在 BeautifulSoup 中找到的元素中删除所有标签?
回答by Daniele B
it looks like this is the way to do! as simple as that
看起来这就是方法!就如此容易
with this line you are joining together the all text parts within the current element
使用这一行,您将当前元素中的所有文本部分连接在一起
''.join(htmlelement.find(text=True))
回答by danblack
You can use the decompose method in bs4:
你可以在 bs4 中使用分解方法:
soup = bs4.BeautifulSoup('<body><a href="http://example.com/">I linked to <i>example.com</i></a></body>')
for a in soup.find('a').children:
if isinstance(a,bs4.element.Tag):
a.decompose()
print soup
Out: <html><body><a href="http://example.com/">I linked to </a></body></html>
回答by Bobby
why has no answer I've seen mentioned anything about the unwrapmethod? Or, even easier, the get_textmethod
为什么我看到的答案没有提到有关该unwrap方法的任何内容?或者,更简单的get_text方法
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwraphttp://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwrap http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text
回答by shawnl
With BeautifulStoneSoupgone in bs4, it's even simpler in Python3
使用got BeautifulStoneSoupin bs4,在 Python3 中就更简单了
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.get_text()
print(text)
回答by SparkAndShine
Use get_text(), it returns all the text in a document or beneath a tag, as a single Unicode string.
使用get_text(),它将文档中或标签下的所有文本作为单个 Unicode 字符串返回。
For instance, remove all different script tags from the following text:
例如,从以下文本中删除所有不同的脚本标签:
<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>
The expected result is:
预期的结果是:
Signal et Communication
Ingénierie Réseaux et Télécommunications
Here is the source code:
这是源代码:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
text = '''
<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>
'''
soup = BeautifulSoup(text)
print(soup.get_text())
回答by Chaitanya Mallepudi
Here is the source code: you can get the text which is exactly in the URL
这是源代码:您可以获取正好在 URL 中的文本
URL = ''
page = requests.get(URL)
soup = bs4.BeautifulSoup(page.content,'html.parser').get_text()
print(soup)
回答by Shaurya Sheth
Code to simply get the contents as text instead of html:
简单地将内容作为文本而不是 html 获取的代码:
'html_text'parameter is the string which you will pass in this function to get the text
'html_text'参数是您将在此函数中传递以获取文本的字符串
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, 'lxml')
text = soup.get_text()
print(text)

