Python UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'\u2026'

Question

提问by user1063287

I'm learning about urllib2 and Beautiful Soup and on first tests am getting errors like:

我正在学习 urllib2 和 Beautiful Soup 并且在第一次测试中遇到如下错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)

There seem to be lots of posts about this type of error and I have tried the solutions that I can understand but there seem to be catch 22's with them, e.g.:

似乎有很多关于此类错误的帖子，我已经尝试了我能理解的解决方案，但似乎有 22 个问题，例如：

I want to print post.text(where text is a beautiful soup method that just returns the text). str(post.text)and post.textproduce the unicode errors (on things like right apostrophe's 'and ...).

我想打印post.text（其中 text 是一种仅返回文本的漂亮汤方法）。 str(post.text)并post.text产生 unicode 错误（在诸如右撇号'和之类的东西上...）。

So I add post = unicode(post)above str(post.text), then I get:

所以我在post = unicode(post)上面添加str(post.text)，然后我得到：

AttributeError: 'unicode' object has no attribute 'text'

I also tried (post.text).encode()and (post.text).renderContents(). The latter producing the error:

我也试过(post.text).encode()和(post.text).renderContents()。后者产生错误：

AttributeError: 'unicode' object has no attribute 'renderContents'

and then I tried str(post.text).renderContents()and got the error:

然后我尝试str(post.text).renderContents()并得到了错误：

AttributeError: 'str' object has no attribute 'renderContents'

It would be great if I could just define somewhere at the top of the document 'make this content 'interpretable''and still have access to the required textfunction.

如果我可以在文档顶部的某处定义'make this content 'interpretable''并且仍然可以访问所需的text功能，那就太好了。

Update:after suggestions:

更新：经过建议：

If I add post = post.decode("utf-8")above str(post.text)I get:

如果我在post = post.decode("utf-8")上面添加str(post.text)我得到：

TypeError: unsupported operand type(s) for -: 'str' and 'int'

If I add post = post.decode()above str(post.text)I get:

如果我在post = post.decode()上面添加str(post.text)我得到：

AttributeError: 'unicode' object has no attribute 'text'

If I add post = post.encode("utf-8")above (post.text)I get:

如果我在post = post.encode("utf-8")上面添加(post.text)我得到：

AttributeError: 'str' object has no attribute 'text'

I tried print post.text.encode('utf-8')and got:

我尝试print post.text.encode('utf-8')并得到：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)

And for the sake of trying things that might work, I installed lxml for Windows from hereand implemented it with:

为了尝试可能有用的东西，我从这里安装了适用于 Windows 的 lxml并通过以下方式实现了它：

parsed_content = BeautifulSoup(original_content, "lxml")

according to http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters.

根据http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters。

These steps didn't seem to make a difference.

这些步骤似乎没有什么区别。

I'm using Python 2.7.4 and Beautiful Soup 4.

我正在使用 Python 2.7.4 和 Beautiful Soup 4。

Solution:

解决方案：

After getting a deeper understanding of unicode, utf-8 and Beautiful Soup types, it had something to do with my printing methodology. I removed all my strmethods and concatenations, e.g. str(something) + post.text + str(something_else), so that it was something, post.text, something_elseand it seems to be printing well except I have less control of the formatting at this stage (e.g. spaces inserted at ,).

在对unicode、utf-8和Beautiful Soup类型有了更深入的了解后，这与我的打印方法有关。我删除了我所有的str方法和连接，例如str(something) + post.text + str(something_else)，这样something, post.text, something_else它看起来打印得很好，除了在这个阶段我对格式的控制较少（例如在处插入空格,）。

Answer 1

采纳答案by icktoofay

In Python 2, unicodeobjects can only be printed if they can be converted to ASCII. If it can't be encoded in ASCII, you'll get that error. You probably want to explicitly encode it and then print the resulting str:

在 Python 2 中，unicode只有可以将对象转换为 ASCII 才能打印对象。如果它不能用 ASCII 编码，你会得到那个错误。您可能希望对其进行显式编码，然后打印结果str：

print post.text.encode('utf-8')

Answer 2

回答by jeyraof

Did you try .decode()or .decode("utf-8")?

你有没有尝试.decode()还是.decode("utf-8")？

And, I recommend to use lxmlusing html5lib parser

而且，我建议lxml使用html5lib parser

http://lxml.de/html5parser.html

Answer 3

回答by Patpog

    html = urllib.request.urlopen(THE_URL).read()
    soup = BeautifulSoup(html)
    print("'" + str(soup.encode("ascii")) + "'")

worked for me ;-)

为我工作;-)

Python UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'\u2026'

提问by user1063287

采纳答案by icktoofay

回答by jeyraof

回答by Patpog

相关推荐

最近更新

标签

Python UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'\u2026'

提问by user1063287

采纳答案by icktoofay

回答by jeyraof

回答by Patpog

相关推荐

python split() 与 rsplit() 性能？

如何去除Python字符串中的逗号

Python 请求 URL 中缺少方案

Python 将一维数组转换为numpy矩阵

相关推荐

最近更新

标签