Python 如何替换或删除 HTML 实体，如“ ” 使用 BeautifulSoup 4

Question

提问by Richard Neish

I am processing HTML using Python and the BeautifulSoup 4 library and I can't find an obvious way to replace  with a space. Instead it seems to be converted to a Unicode non-breaking space character.

我正在使用 Python 和 BeautifulSoup 4 库处理 HTML，但找不到 用空格替换的明显方法。相反，它似乎被转换为 Unicode 不间断空格字符。

Am I missing something obvious? What is the best way to replace   with a normal space using BeautifulSoup?

我错过了一些明显的东西吗？更换的最佳方法是什么？使用 BeautifulSoup 的普通空间？

Edit to add that I am using the latest version, BeautifulSoup 4, so the convertEntities=BeautifulSoup.HTML_ENTITIESoption in Beautiful Soup 3 isn't available.

编辑以添加我使用的是最新版本 BeautifulSoup 4，因此convertEntities=BeautifulSoup.HTML_ENTITIESBeautiful Soup 3 中的选项不可用。

Answer 1

采纳答案by Martijn Pieters

See Entitiesin the documentation. BeautifulSoup 4 produces proper Unicode for all entities:

请参阅文档中的实体。BeautifulSoup 4 为所有实体生成正确的 Unicode：

An incoming HTML or XML entity is always converted into the corresponding Unicode character.

传入的 HTML 或 XML 实体始终转换为相应的 Unicode 字符。

Yes,  is turned into a non-breaking space character. If you really want those to be space characters instead, you'll have to do a unicode replace.

是的， 变成了一个不间断的空格字符。如果您真的希望它们成为空格字符，则必须进行 unicode 替换。

Answer 2

回答by Fabian

>>> soup = BeautifulSoup('<div>a&nbsp;b</div>')
>>> soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
u'<html>\n <body>\n  <div>\n   a b\n  </div>\n </body>\n</html>'

Answer 3

回答by LancDec

You can simply replace the non-breaking space unicode with a normal space.

您可以简单地用普通空格替换不间断空格 unicode。

nonBreakSpace = u'\xa0'
soup = soup.replace(nonBreakSpace, ' ')

A benefit is that even though you are using BeautifulSoup, you do not need to.

一个好处是，即使您正在使用 BeautifulSoup，您也不需要这样做。

Answer 4

回答by MortenB

I had issues with json that soup.prettify() did not fix, so it worked with unicodedata.normalize():

我遇到了 json 问题，soup.prettify() 没有解决，所以它与unicodedata.normalize()一起工作：

import unicodedata
soup = BeautifulSoup(r.text, 'html.parser')
dat = soup.find('span', attrs={'class': 'date'})
print(f"date prints fine:'{dat.text}'")
print(f"json:{json.dumps(dat.text)}")
mydate = unicodedata.normalize("NFKD",dat.text)
print(f"json after normalizing:'{json.dumps(mydate)}'")

date prints fine:'03 Nov 19 17:51'
json:"03\u00a0Nov\u00a019\u00a017:51"
json after normalizing:'"03 Nov 19 17:51"'

Python 如何替换或删除 HTML 实体，如“ ” 使用 BeautifulSoup 4

提问by Richard Neish

采纳答案by Martijn Pieters

回答by Fabian

回答by LancDec

回答by MortenB

相关推荐

最近更新

标签

Python 如何替换或删除 HTML 实体，如“ ” 使用 BeautifulSoup 4

提问by Richard Neish

采纳答案by Martijn Pieters

回答by Fabian

回答by LancDec

回答by MortenB

相关推荐

Python 请求 - 没有连接适配器

Python 如何访问稀疏矩阵元素？

|= (ior) 在 Python 中做什么？

Python 将函数应用于熊猫数据框的每一行以创建两个新列

相关推荐

最近更新

标签