Python 如何替换或删除 HTML 实体,如“ ” 使用 BeautifulSoup 4
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15138406/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I replace or remove HTML entities like " " using BeautifulSoup 4
提问by Richard Neish
I am processing HTML using Python and the BeautifulSoup 4 library and I can't find an obvious way to replace with a space. Instead it seems to be converted to a Unicode non-breaking space character.
我正在使用 Python 和 BeautifulSoup 4 库处理 HTML,但找不到 用空格替换的明显方法。相反,它似乎被转换为 Unicode 不间断空格字符。
Am I missing something obvious? What is the best way to replace with a normal space using BeautifulSoup?
我错过了一些明显的东西吗?更换 的最佳方法是什么?使用 BeautifulSoup 的普通空间?
Edit to add that I am using the latest version, BeautifulSoup 4, so the convertEntities=BeautifulSoup.HTML_ENTITIESoption in Beautiful Soup 3 isn't available.
编辑以添加我使用的是最新版本 BeautifulSoup 4,因此convertEntities=BeautifulSoup.HTML_ENTITIESBeautiful Soup 3 中的选项不可用。
采纳答案by Martijn Pieters
See Entitiesin the documentation. BeautifulSoup 4 produces proper Unicode for all entities:
请参阅文档中的实体。BeautifulSoup 4 为所有实体生成正确的 Unicode:
An incoming HTML or XML entity is always converted into the corresponding Unicode character.
传入的 HTML 或 XML 实体始终转换为相应的 Unicode 字符。
Yes, is turned into a non-breaking space character. If you really want those to be space characters instead, you'll have to do a unicode replace.
是的, 变成了一个不间断的空格字符。如果您真的希望它们成为空格字符,则必须进行 unicode 替换。
回答by Fabian
>>> soup = BeautifulSoup('<div>a b</div>')
>>> soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
u'<html>\n <body>\n <div>\n a b\n </div>\n </body>\n</html>'
回答by LancDec
You can simply replace the non-breaking space unicode with a normal space.
您可以简单地用普通空格替换不间断空格 unicode。
nonBreakSpace = u'\xa0'
soup = soup.replace(nonBreakSpace, ' ')
A benefit is that even though you are using BeautifulSoup, you do not need to.
一个好处是,即使您正在使用 BeautifulSoup,您也不需要这样做。
回答by MortenB
I had issues with json that soup.prettify() did not fix, so it worked with unicodedata.normalize():
我遇到了 json 问题,soup.prettify() 没有解决,所以它与unicodedata.normalize()一起工作:
import unicodedata
soup = BeautifulSoup(r.text, 'html.parser')
dat = soup.find('span', attrs={'class': 'date'})
print(f"date prints fine:'{dat.text}'")
print(f"json:{json.dumps(dat.text)}")
mydate = unicodedata.normalize("NFKD",dat.text)
print(f"json after normalizing:'{json.dumps(mydate)}'")
date prints fine:'03 Nov 19 17:51'
json:"03\u00a0Nov\u00a019\u00a017:51"
json after normalizing:'"03 Nov 19 17:51"'

