Python 删除 \u2018 和 \u2019 字符

Question

提问by bhavesh

I am using Beautiful Soup to parse webpages and printing the name of the webpages visited on the terminal. However, often the name of the webpage has single left (\u2018)and right(\u2019)character which the python can't print as it gives charmap encoding error. Is there any way to remove these characters?

我正在使用 Beautiful Soup 来解析网页并打印在终端上访问过的网页的名称。但是，网页名称通常具有单个左(\u2018)和右(\u2019)字符，python 无法打印这些字符，因为它会导致 charmap 编码错误。有什么办法可以去掉这些字符吗？

Answer 1

采纳答案by Martin Konecny

These codes are Unicode for the single left and right quote characters. You can replace them with their ASCII equivalent which Python shouldn't have any problem printing on your system:

这些代码是用于单个左引号和右引号字符的 Unicode。你可以用它们的 ASCII 等价物替换它们，Python 在你的系统上打印应该没有任何问题：

>>> print u"\u2018Hi\u2019"
‘Hi'
>>> print u"\u2018Hi\u2019".replace(u"\u2018", "'").replace(u"\u2019", "'")
'Hi'

Alternatively with regex:

或者使用正则表达式：

import re
s = u"\u2018Hi\u2019"
>>> print re.sub(u"(\u2018|\u2019)", "'", s)
'Hi'

However Python shouldn't have any problem printing the Unicode version of these as well. It's possible that you are using str()somewhere which will try to convert your unicode to ascii and throw your exception.

然而，Python 打印这些的 Unicode 版本也不应该有任何问题。您可能正在使用str()某处尝试将您的 unicode 转换为 ascii 并抛出您的异常。

Python 删除 \u2018 和 \u2019 字符

提问by bhavesh

采纳答案by Martin Konecny

相关推荐

最近更新

标签

Python 删除 \u2018 和 \u2019 字符

提问by bhavesh

采纳答案by Martin Konecny

相关推荐

Python 任何异常的全局错误处理程序

Python 未定义全局名称“sqrt”

Python 中列表中第一个单词的首字母大写

Python 如何部分读取一个巨大的 CSV 文件？

相关推荐

最近更新

标签