Python 如何使用 BeautifulSoup 将 UTF-8 编码的 HTML 正确解析为 Unicode 字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20205455/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:52:55  来源:igfitidea点击:

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

pythonunicodeutf-8beautifulsoupurllib2

提问by Christopher Orr

I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup.

我正在运行一个 Python 程序,它获取一个 UTF-8 编码的网页,并使用 BeautifulSoup 从 HTML 中提取一些文本。

However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding.

但是,当我将此文本写入文件(或在控制台上打印)时,它以意外的编码写入。

Sample program:

示例程序:

import urllib2
from BeautifulSoup import BeautifulSoup

# Fetch URL
url = 'http://www.voxnow.de/'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')

# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)

# Parse with BeautifulSoup
soup = BeautifulSoup(response)

# Print title attribute of a <div> which uses umlauts (e.g. k?nnen)
print repr(soup.find('div', id='navbutton_account')['title'])

Running this gives the result:

运行这给出了结果:

# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'

But I would expect a Python Unicode string to render ?in the word k?nnenas \xf6:

但我希望 Python Unicode 字符串?在单词中呈现k?nnen\xf6

# u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'

I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to read()and decode()the responseobject, but it either makes no difference, or throws an error.

我已经试过了“fromEncoding”参数传递给BeautifulSoup,并试图read()decode()response对象,但它要么没什么区别,或引发错误。

With the command curl www.voxnow.de | hexdump -C, I can see that the web page is indeed UTF-8 encoded (i.e. it contains 0xc3 0xb6) for the ?character:

使用命令curl www.voxnow.de | hexdump -C,我可以看到网页确实是 UTF-8 编码(即它包含0xc3 0xb6)的?字符:

      20 74 69 74 6c 65 3d 22  48 69 65 72 20 6b c3 b6  | title="Hier k..|
      6e 6e 65 6e 20 53 69 65  20 73 69 63 68 20 6b 6f  |nnen Sie sich ko|
      73 74 65 6e 6c 6f 73 20  72 65 67 69 73 74 72 69  |stenlos registri|

I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?

我超出了我的 Python 能力的极限,所以我不知道如何进一步调试它。有什么建议吗?

采纳答案by Christopher Orr

As justhalf points out above, my question here is essentially a duplicate of this question.

正如justhalf在上面指出的那样,我在这里的问题基本上是这个问题的重复。

The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.

HTML 内容将自身报告为 UTF-8 编码,并且在大多数情况下是这样,除了一两个流氓无效 UTF-8 字符。

This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:

这显然让 BeautifulSoup 混淆了正在使用的编码,以及在将内容传递给 BeautifulSoup 时尝试首先解码为 UTF-8 时,如下所示:

soup = BeautifulSoup(response.read().decode('utf-8'))

I would get the error:

我会得到错误:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813: 
                    invalid continuation byte

Looking more closely at the output, there was an instance of the character üwhich was wrongly encoded as the invalid byte sequence 0xe3 0x9c, rather than the correct 0xc3 0x9c.

更仔细地观察输出,有一个字符实例ü被错误地编码为无效字节序列0xe3 0x9c,而不是正确的0xc3 0x9c.

As the currently highest-rated answeron that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:

正如该问题当前评分最高的答案所暗示的那样,可以在解析时删除无效的 UTF-8 字符,以便仅将有效数据传递给 BeautifulSoup:

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

回答by Birei

Encoding the result to utf-8seems to work for me:

将结果编码为utf-8似乎对我有用:

print (soup.find('div', id='navbutton_account')['title']).encode('utf-8')

It yields:

它产生:

Hier k??nnen Sie sich kostenlos registrieren und / oder einloggen!