Python,从字符串中删除所有html标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37018475/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:42:42  来源:igfitidea点击:

Python, remove all html tags from string

pythonhtmlstringparsingbeautifulsoup

提问by Mustard Tiger

I am trying to access the article content from a website, using beautifulsoup with the below code:

我正在尝试使用beautifulsoup和以下代码从网站访问文章内容:

site= 'www.example.com'
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
content = soup.find_all('p')
content=str(content)

the content object contains all of the main text from the page that is within the 'p' tag, however there are still other tags present within the output as can be seen in the image below. I would like to remove all characters that are enclosed in matching pairs of < > tags and the tags themselves. so that only the text remains.

内容对象包含页面中“p”标签内的所有主要文本,但是输出中仍然存在其他标签,如下图所示。我想删除包含在匹配的 < > 标签对中的所有字符和标签本身。这样就只剩下文本了。

I have tried the following method, but it does not seem to work.

我尝试了以下方法,但似乎不起作用。

' '.join(item for item in content.split() if not (item.startswith('<') and item.endswith('>')))

What is the best way to remove substrings in a sting? that begin and end with a certain pattern such as < >

删除刺中子字符串的最佳方法是什么?以某种模式开始和结束,例如 < >

enter image description here

enter image description here

回答by Ani Menon

Using regEx:

使用正则表达式:

re.sub('<[^<]+?>', '', text)

Using BeautifulSoup:(Solution from here)

使用 BeautifulSoup :(这里的解决方案)

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

Using NLTK:

使用 NLTK:

import nltk   
from urllib import urlopen
url = "https://stackoverflow.com/questions/tagged/python"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

回答by JRodDynamite

You could use get_text()

你可以用 get_text()

for i in content:
    print i.get_text()

Example below is from the docs:

下面的示例来自文档

>>> markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
>>> soup = BeautifulSoup(markup)
>>> soup.get_text()
u'\nI linked to example.com\n'

回答by Burhan Khalid

You need to use the strings generator:

您需要使用字符串生成器

for text in content.strings:
   print(text)

回答by PaulMcG

Pyparsing makes it easy to write an HTML stripper by defining a pattern matching all opening and closing HTML tags, and then transforming the input using that pattern as a suppressor. This still leaves the &xxx;HTML entities to be converted - you can use xml.sax.saxutils.unescapeto do that:

Pyparsing 通过定义匹配所有 HTML 开始和结束标记的模式,然后使用该模式作为抑制器来转换输入,从而可以轻松编写 HTML 剥离器。这仍然留下&xxx;要转换的HTML 实体 - 您可以使用xml.sax.saxutils.unescape它:

source = """
<p><strong>Editors' Pick: Originally published March 22.<br /> <br /> Apple</strong> <span class=" TICKERFLAT">(<a href="/quote/AAPL.html">AAPL</a> - <a href="http://secure2.thestreet.com/cap/prm.do?OID=028198&amp;ticker=AAPL">Get Report</a><a class=" arrow" href="/quote/AAPL.html"><span class=" tickerChange" id="story_AAPL"></span></a>)</span> is waking up the echoes with the reintroduction of a&nbsp;4-inch iPhone, a model&nbsp;its creators hope will lead the company to victory not just in emerging markets, but at home as well.</p> 
<p>&quot;There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features,&quot; Hymandaw Research Chief Analyst Jan Dawson said in e-mailed comments.</p> 
<p>The new model, dubbed the iPhone SE, &quot;should unleash a decent upgrade cycle over the coming months,&quot; Dawson said.&nbsp;Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.</p>
<div class=" butonTextPromoAd">
 <div class=" ym" id="ym_44444440"></div>"""

from pyparsing import anyOpenTag, anyCloseTag
from xml.sax.saxutils import unescape as unescape
unescape_xml_entities = lambda s: unescape(s, {"&apos;": "'", "&quot;": '"', "&nbsp;":" "})

stripper = (anyOpenTag | anyCloseTag).suppress()

print(unescape_xml_entities(stripper.transformString(source)))

gives:

给出:

Editors' Pick: Originally published March 22.  Apple (AAPL - Get Report) is waking up the echoes with the reintroduction of a 4-inch iPhone, a model its creators hope will lead the company to victory not just in emerging markets, but at home as well. 
"There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features," Hymandaw Research Chief Analyst Jan Dawson said in e-mailed comments. 
The new model, dubbed the iPhone SE, "should unleash a decent upgrade cycle over the coming months," Dawson said. Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.

(And in future, please do not provide sample text or code as non-copy-pasteable images.)

(将来,请不要提供示例文本或代码作为不可复制粘贴的图像。)

回答by Krishnadas V A

if you restricted to use any library you can simply use the below code for remove html tags.

如果你限制使用任何库,你可以简单地使用下面的代码来删除 html 标签。

i just correct what you tried. thanks for the idea

我只是纠正你的尝试。谢谢你的想法

content="<h4 style='font-size: 11pt; color: rgb(67, 67, 67); font-family: arial, sans-serif;'>Sample text for display.</h4> <p>&nbsp;</p>"


' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])

回答by lavr2004

Simple algorithm that will work in every language without modules and additional libs imported. Code is self-documented:

无需导入模块和额外库即可在每种语言中运行的简单算法。代码是自我记录的:

def removetags_fc(data_str):
    appendingmode_bool = True
    output_str = ''
    for char_str in data_str:
        if char_str == '>':
            appendingmode_bool = False
        elif char_str == '<':
            appendingmode_bool = True
            continue
        if appendingmode_bool:
            output_str += char_str
    return output_str

For better realization literals '>' and '<' need to be instanced in memory one time before loop start.

为了更好地实现文字 '>' 和 '<' 需要在循环开始前在内存中实例化一次。