Python 类型错误:不能在 re.findall() 中的类似字节的对象上使用字符串模式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31019854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
TypeError: can't use a string pattern on a bytes-like object in re.findall()
提问by Inspired_Blue
I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage:
我正在尝试学习如何从页面自动获取 url。在下面的代码中,我试图获取网页的标题:
import urllib.request
import re
url = "http://www.google.com"
regex = r'<title>(,+?)</title>'
pattern = re.compile(regex)
with urllib.request.urlopen(url) as response:
html = response.read()
title = re.findall(pattern, html)
print(title)
And I get this unexpected error:
我收到了这个意外错误:
Traceback (most recent call last):
File "path\to\file\Crawler.py", line 11, in <module>
title = re.findall(pattern, html)
File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
What am I doing wrong?
我究竟做错了什么?
采纳答案by rocky
You want to convert html (a byte-like object) into a string using .decode
, e.g. html = response.read().decode('utf-8')
.
您想使用 将 html(类似字节的对象)转换为字符串.decode
,例如 html = response.read().decode('utf-8')
.
回答by Aran-Fey
The problem is that your regex is a string, but html
is bytes:
问题是,你的正则表达式是一个字符串,但html
为字节:
>>> type(html)
<class 'bytes'>
Since python doesn't know how those bytes are encoded, it throws an exception when you try to use a string regex on them.
由于 python 不知道这些字节是如何编码的,因此当您尝试对它们使用字符串正则表达式时,它会引发异常。
You can either decode
the bytes to a string:
您可以decode
将字节转换为字符串:
html = html.decode('ISO-8859-1') # encoding may vary!
title = re.findall(pattern, html) # no more error
Or use a bytes regex:
或使用字节正则表达式:
regex = rb'<title>(,+?)</title>'
# ^
In this particular context, you can get the encoding from the response headers:
在此特定上下文中,您可以从响应标头中获取编码:
with urllib.request.urlopen(url) as response:
encoding = response.info().get_param('charset', 'utf8')
html = response.read().decode(encoding)
See the urlopen
documentationfor more details.
有关更多详细信息,请参阅urlopen
文档。