将 io.BytesIO 转换为 io.StringIO 以解析 HTML 页面

Question

提问by Shipra

I'm trying to parse a HTML page I retrieved through pyCurl but the pyCurl WRITEFUNCTION is returning the page as BYTES and not string, so I'm unable to Parse it using BeautifulSoup.

我正在尝试解析通过 pyCurl 检索到的 HTML 页面，但 pyCurl WRITEFUNCTION 将页面作为字节而不是字符串返回，因此我无法使用 BeautifulSoup 对其进行解析。

Is there any way to convert io.BytesIO to io.StringIO?

有没有办法将 io.BytesIO 转换为 io.StringIO？

Or Is there any other way to parse the HTML page?

或者还有其他方法可以解析 HTML 页面吗？

I'm using Python 3.3.2.

我正在使用 Python 3.3.2。

Answer 1

采纳答案by Anthony Sottile

A naive approach:

一种幼稚的方法：

# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()

# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8')  # Or use the encoding you expect

# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that's what you need

Answer 2

回答by kakarukeys

the code in the accepted answer actually reads from the stream completely for decoding. Below is the right way, converting one stream to another, where the data can be read chunk by chunk.

接受的答案中的代码实际上完全从流中读取以进行解码。下面是正确的方法，将一个流转换为另一个流，其中可以逐块读取数据。

# Initialize a read buffer
input = io.BytesIO(
    b'Inital value for read buffer with unicode characters ' +
    'á?ê'.encode('utf-8')
)
wrapper = io.TextIOWrapper(input, encoding='utf-8')

# Read from the buffer
print(wrapper.read())

将 io.BytesIO 转换为 io.StringIO 以解析 HTML 页面

提问by Shipra

采纳答案by Anthony Sottile

回答by kakarukeys

相关推荐

最近更新

标签

将 io.BytesIO 转换为 io.StringIO 以解析 HTML 页面

提问by Shipra

采纳答案by Anthony Sottile

回答by kakarukeys

相关推荐

Html 使用 twitter-bootstrap 调整大小的图像

如何在 HTML 表单中“预填充”文本区域的值？

Html 中断半行 <br>

Html url 中的 google plus 共享和参数

相关推荐

最近更新

标签