python HTML 实体代码到文本

Question

提问by tghw

Does anyone know an easy way in Python to convert a string with HTML entity codes (e.g. <&) to a normal string (e.g. < &)?

有谁知道在 Python 中将带有 HTML 实体代码（例如<&）的字符串转换为普通字符串（例如 < &）的简单方法？

cgi.escape()will escape strings (poorly), but there is no unescape().

cgi.escape()将转义字符串（不好），但没有unescape().

Answer 1

回答by bobince

HTMLParserhas the functionality in the standard library. It is, unfortunately, undocumented:

HTMLParser具有标准库中的功能。不幸的是，它没有记录：

(Python2 Docs)

（Python2文档）

>>> import HTMLParser
>>> h= HTMLParser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
u'alpha < \u03b2'

(Python 3 Docs)

（Python 3文档）

>>> import html.parser
>>> h = html.parser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
'alpha < \u03b2'

htmlentitydefsis documented, but requires you to do a lot of the work yourself.

htmlentitydefs已记录在案，但需要您自己做很多工作。

If you only need the XML predefined entities (lt, gt, amp, quot, apos), you could use minidom to parse them. If you only need the predefined entities and no numeric character references, you could even just use a plain old string replace for speed.

如果您只需要 XML 预定义实体（lt、gt、amp、quot、apos），您可以使用 minidom 来解析它们。如果您只需要预定义的实体而不需要数字字符引用，您甚至可以只使用普通的旧字符串替换来提高速度。

Answer 2

回答by tghw

I forgot to tag it at first, but I'm using BeautifulSoup.

一开始我忘了标记它，但我正在使用 BeautifulSoup。

Digging around in the documentation, I found:

在文档中挖掘，我发现：

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

does it exactly as I was hoping.

完全如我所愿。

Answer 3

回答by Benjamin Pollack

There is nothing built into the Python stdlib to unescape HTML, but there's a short script you can tailor to your needs at http://www.w3.org/QA/2008/04/unescape-html-entities-python.html.

Python stdlib 中没有内置任何内容来取消转义 HTML，但是有一个简短的脚本，您可以在http://www.w3.org/QA/2008/04/unescape-html-entities-python.html 上根据您的需要进行定制。

Answer 4

回答by vartec

Use htmlentitydefsmodule. This my old code, it worked, but I'm sure there is cleaner and more pythonic way to do it:

使用htmlentitydefs模块。这是我的旧代码，它有效，但我确信有更干净、更 Pythonic 的方法来做到这一点：

e2c = dict(('&%s;'%k,eval("u'\u%04x'"%v)) for k, v in htmlentitydefs.name2codepoint.items())

python HTML 实体代码到文本

提问by tghw

回答by bobince

回答by tghw

回答by Benjamin Pollack

回答by vartec

相关推荐

最近更新

标签

python HTML 实体代码到文本

提问by tghw

回答by bobince

回答by tghw

回答by Benjamin Pollack

回答by vartec

相关推荐

是否需要“使用严格”的 Python 编译器？

python 以编程方式发现公共 IP

python Python正则表达式通过两个分隔符之一拆分字符串

python SQLAlchemy 在提交前使用自动增量获取主键

相关推荐

最近更新

标签