使用 python，从字符串中删除 HTML 标签/格式

Question

提问by Blankman

I have a string that contains html markup like links, bold text, etc.

我有一个字符串，其中包含 html 标记，如链接、粗体文本等。

I want to strip all the tags so I just have the raw text.

我想去掉所有标签，所以我只有原始文本。

What's the best way to do this? regex?

做到这一点的最佳方法是什么？正则表达式？

Answer 1

回答by John Howard

If you are going to use regex:

如果您要使用正则表达式：

import re
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

>>> striphtml('<a href="foo.com" class="bar">I Want This <b>text!</b></a>')
'I Want This text!'

Answer 2

回答by snurre

Depending on whether the text will contain '>' or '<' I would either just make a function to remove anything between those, or use a parsing lib

根据文本是否包含 '>' 或 '<'，我要么创建一个函数来删除它们之间的任何内容，要么使用解析库

def cleanStrings(self, inStr):
  a = inStr.find('<')
  b = inStr.find('>')
  if a < 0 and b < 0:
    return inStr
  return cleanString(inStr[a:b-a])

Answer 3

回答by volting

AFAIK using regex is a bad idea for parsing HTML, you would be better off using a HTML/XML parser like beautiful soup.

AFAIK 使用正则表达式来解析 HTML 是一个坏主意，你最好使用 HTML/XML 解析器，比如beautiful soup。

Answer 4

回答by Wai Yip Tung

Use SGMLParser. regexworks in simple case. But there are a lot of intricacy with HTML you rather not have to deal with.

使用SGMLParser. regex在简单的情况下工作。但是 HTML 有很多错综复杂的问题，您无需处理。

>>> from sgmllib import SGMLParser
>>>
>>> class TextExtracter(SGMLParser):
...     def __init__(self):
...         self.text = []
...         SGMLParser.__init__(self)
...     def handle_data(self, data):
...         self.text.append(data)
...     def getvalue(self):
...         return ''.join(ex.text)
...
>>> ex = TextExtracter()
>>> ex.feed('<html>hello &gt; world</html>')
>>> ex.getvalue()
'hello > world'

Answer 5

回答by Tim McNamara

Use lxml.html. It's much faster than BeautifulSoup and raw text is a single command.

使用 lxml.html。它比 BeautifulSoup 快得多，原始文本是一个命令。

>>> import lxml.html
>>> page = lxml.html.document_fromstring('<!DOCTYPE html>...</html>')
>>> page.cssselect('body')[0].text_content()
'...'

使用 python，从字符串中删除 HTML 标签/格式

提问by Blankman

回答by John Howard

回答by snurre

回答by volting

回答by Wai Yip Tung

回答by Tim McNamara

相关推荐

最近更新

标签

使用 python，从字符串中删除 HTML 标签/格式

提问by Blankman

回答by John Howard

回答by snurre

回答by volting

回答by Wai Yip Tung

回答by Tim McNamara

相关推荐

Python 如何在 MS Windows 操作系统上使用 Google 的 repo 工具？

Python 将字符串重复到一定长度

Python：为什么我会收到 [Errno 13] 权限被拒绝？

Python 如何在没有循环的情况下将负元素转换为零？

相关推荐

最近更新

标签