将 io.BytesIO 转换为 io.StringIO 以解析 HTML 页面

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24566630/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 02:11:19  来源:igfitidea点击:

Convert io.BytesIO to io.StringIO to parse HTML page

htmlbeautifulsouppycurlstringiotype-conversion

提问by Shipra

I'm trying to parse a HTML page I retrieved through pyCurl but the pyCurl WRITEFUNCTION is returning the page as BYTES and not string, so I'm unable to Parse it using BeautifulSoup.

我正在尝试解析通过 pyCurl 检索到的 HTML 页面,但 pyCurl WRITEFUNCTION 将页面作为字节而不是字符串返回,因此我无法使用 BeautifulSoup 对其进行解析。

Is there any way to convert io.BytesIO to io.StringIO?

有没有办法将 io.BytesIO 转换为 io.StringIO?

Or Is there any other way to parse the HTML page?

或者还有其他方法可以解析 HTML 页面吗?

I'm using Python 3.3.2.

我正在使用 Python 3.3.2。

采纳答案by Anthony Sottile

A naive approach:

一种幼稚的方法:

# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()

# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8')  # Or use the encoding you expect

# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that's what you need

回答by kakarukeys

the code in the accepted answer actually reads from the stream completely for decoding. Below is the right way, converting one stream to another, where the data can be read chunk by chunk.

接受的答案中的代码实际上完全从流中读取以进行解码。下面是正确的方法,将一个流转换为另一个流,其中可以逐块读取数据。

# Initialize a read buffer
input = io.BytesIO(
    b'Inital value for read buffer with unicode characters ' +
    'á?ê'.encode('utf-8')
)
wrapper = io.TextIOWrapper(input, encoding='utf-8')

# Read from the buffer
print(wrapper.read())