使用 python,从字符串中删除 HTML 标签/格式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3398852/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
using python, Remove HTML tags/formatting from a string
提问by Blankman
I have a string that contains html markup like links, bold text, etc.
我有一个字符串,其中包含 html 标记,如链接、粗体文本等。
I want to strip all the tags so I just have the raw text.
我想去掉所有标签,所以我只有原始文本。
What's the best way to do this? regex?
做到这一点的最佳方法是什么?正则表达式?
回答by John Howard
If you are going to use regex:
如果您要使用正则表达式:
import re
def striphtml(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
>>> striphtml('<a href="foo.com" class="bar">I Want This <b>text!</b></a>')
'I Want This text!'
回答by snurre
Depending on whether the text will contain '>' or '<' I would either just make a function to remove anything between those, or use a parsing lib
根据文本是否包含 '>' 或 '<',我要么创建一个函数来删除它们之间的任何内容,要么使用解析库
def cleanStrings(self, inStr):
a = inStr.find('<')
b = inStr.find('>')
if a < 0 and b < 0:
return inStr
return cleanString(inStr[a:b-a])
回答by volting
AFAIK using regex is a bad idea for parsing HTML, you would be better off using a HTML/XML parser like beautiful soup.
AFAIK 使用正则表达式来解析 HTML 是一个坏主意,你最好使用 HTML/XML 解析器,比如beautiful soup。
回答by Wai Yip Tung
Use SGMLParser. regexworks in simple case. But there are a lot of intricacy with HTML you rather not have to deal with.
使用SGMLParser. regex在简单的情况下工作。但是 HTML 有很多错综复杂的问题,您无需处理。
>>> from sgmllib import SGMLParser
>>>
>>> class TextExtracter(SGMLParser):
... def __init__(self):
... self.text = []
... SGMLParser.__init__(self)
... def handle_data(self, data):
... self.text.append(data)
... def getvalue(self):
... return ''.join(ex.text)
...
>>> ex = TextExtracter()
>>> ex.feed('<html>hello > world</html>')
>>> ex.getvalue()
'hello > world'
回答by Tim McNamara
Use lxml.html. It's much faster than BeautifulSoup and raw text is a single command.
使用 lxml.html。它比 BeautifulSoup 快得多,原始文本是一个命令。
>>> import lxml.html
>>> page = lxml.html.document_fromstring('<!DOCTYPE html>...</html>')
>>> page.cssselect('body')[0].text_content()
'...'

