python HTML 实体代码到文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/663058/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HTML Entity Codes to Text
提问by tghw
Does anyone know an easy way in Python to convert a string with HTML entity codes (e.g. <
&
) to a normal string (e.g. < &)?
有谁知道在 Python 中将带有 HTML 实体代码(例如<
&
)的字符串转换为普通字符串(例如 < &)的简单方法?
cgi.escape()
will escape strings (poorly), but there is no unescape()
.
cgi.escape()
将转义字符串(不好),但没有unescape()
.
回答by bobince
HTMLParserhas the functionality in the standard library. It is, unfortunately, undocumented:
HTMLParser具有标准库中的功能。不幸的是,它没有记录:
(Python2 Docs)
(Python2文档)
>>> import HTMLParser
>>> h= HTMLParser.HTMLParser()
>>> h.unescape('alpha < β')
u'alpha < \u03b2'
(Python 3 Docs)
(Python 3文档)
>>> import html.parser
>>> h = html.parser.HTMLParser()
>>> h.unescape('alpha < β')
'alpha < \u03b2'
htmlentitydefsis documented, but requires you to do a lot of the work yourself.
htmlentitydefs已记录在案,但需要您自己做很多工作。
If you only need the XML predefined entities (lt, gt, amp, quot, apos), you could use minidom to parse them. If you only need the predefined entities and no numeric character references, you could even just use a plain old string replace for speed.
如果您只需要 XML 预定义实体(lt、gt、amp、quot、apos),您可以使用 minidom 来解析它们。如果您只需要预定义的实体而不需要数字字符引用,您甚至可以只使用普通的旧字符串替换来提高速度。
回答by tghw
I forgot to tag it at first, but I'm using BeautifulSoup.
一开始我忘了标记它,但我正在使用 BeautifulSoup。
Digging around in the documentation, I found:
在文档中挖掘,我发现:
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
does it exactly as I was hoping.
完全如我所愿。
回答by Benjamin Pollack
There is nothing built into the Python stdlib to unescape HTML, but there's a short script you can tailor to your needs at http://www.w3.org/QA/2008/04/unescape-html-entities-python.html.
Python stdlib 中没有内置任何内容来取消转义 HTML,但是有一个简短的脚本,您可以在http://www.w3.org/QA/2008/04/unescape-html-entities-python.html 上根据您的需要进行定制。
回答by vartec
Use htmlentitydefsmodule. This my old code, it worked, but I'm sure there is cleaner and more pythonic way to do it:
使用htmlentitydefs模块。这是我的旧代码,它有效,但我确信有更干净、更 Pythonic 的方法来做到这一点:
e2c = dict(('&%s;'%k,eval("u'\u%04x'"%v)) for k, v in htmlentitydefs.name2codepoint.items())