Python 如何替换或删除 HTML 实体,如“ ” 使用 BeautifulSoup 4

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15138406/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:26:45  来源:igfitidea点击:

How can I replace or remove HTML entities like " " using BeautifulSoup 4

pythonbeautifulsoup

提问by Richard Neish

I am processing HTML using Python and the BeautifulSoup 4 library and I can't find an obvious way to replace  with a space. Instead it seems to be converted to a Unicode non-breaking space character.

我正在使用 Python 和 BeautifulSoup 4 库处理 HTML,但找不到 用空格替换的明显方法。相反,它似乎被转换为 Unicode 不间断空格字符。

Am I missing something obvious? What is the best way to replace   with a normal space using BeautifulSoup?

我错过了一些明显的东西吗?更换   的最佳方法是什么?使用 BeautifulSoup 的普通空间?

Edit to add that I am using the latest version, BeautifulSoup 4, so the convertEntities=BeautifulSoup.HTML_ENTITIESoption in Beautiful Soup 3 isn't available.

编辑以添加我使用的是最新版本 BeautifulSoup 4,因此convertEntities=BeautifulSoup.HTML_ENTITIESBeautiful Soup 3 中的选项不可用。

采纳答案by Martijn Pieters

See Entitiesin the documentation. BeautifulSoup 4 produces proper Unicode for all entities:

请参阅文档中的实体。BeautifulSoup 4 为所有实体生成正确的 Unicode:

An incoming HTML or XML entity is always converted into the corresponding Unicode character.

传入的 HTML 或 XML 实体始终转换为相应的 Unicode 字符。

Yes,  is turned into a non-breaking space character. If you really want those to be space characters instead, you'll have to do a unicode replace.

是的, 变成了一个不间断的空格字符。如果您真的希望它们成为空格字符,则必须进行 unicode 替换。

回答by Fabian

>>> soup = BeautifulSoup('<div>a&nbsp;b</div>')
>>> soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
u'<html>\n <body>\n  <div>\n   a b\n  </div>\n </body>\n</html>'

回答by LancDec

You can simply replace the non-breaking space unicode with a normal space.

您可以简单地用普通空格替换不间断空格 unicode。

nonBreakSpace = u'\xa0'
soup = soup.replace(nonBreakSpace, ' ')

A benefit is that even though you are using BeautifulSoup, you do not need to.

一个好处是,即使您正在使用 BeautifulSoup,您也不需要这样做。

回答by MortenB

I had issues with json that soup.prettify() did not fix, so it worked with unicodedata.normalize():

我遇到了 json 问题,soup.prettify() 没有解决,所以它与unicodedata.normalize()一起工作:

import unicodedata
soup = BeautifulSoup(r.text, 'html.parser')
dat = soup.find('span', attrs={'class': 'date'})
print(f"date prints fine:'{dat.text}'")
print(f"json:{json.dumps(dat.text)}")
mydate = unicodedata.normalize("NFKD",dat.text)
print(f"json after normalizing:'{json.dumps(mydate)}'")
date prints fine:'03 Nov 19 17:51'
json:"03\u00a0Nov\u00a019\u00a017:51"
json after normalizing:'"03 Nov 19 17:51"'