Python 删除 \u2018 和 \u2019 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24358361/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:28:38  来源:igfitidea点击:

Removing \u2018 and \u2019 character

python

提问by bhavesh

I am using Beautiful Soup to parse webpages and printing the name of the webpages visited on the terminal. However, often the name of the webpage has single left (\u2018)and right(\u2019)character which the python can't print as it gives charmap encoding error. Is there any way to remove these characters?

我正在使用 Beautiful Soup 来解析网页并打印在终端上访问过的网页的名称。但是,网页名称通常具有单个左(\u2018)和右(\u2019)字符,python 无法打印这些字符,因为它会导致 charmap 编码错误。有什么办法可以去掉这些字符吗?

采纳答案by Martin Konecny

These codes are Unicode for the single left and right quote characters. You can replace them with their ASCII equivalent which Python shouldn't have any problem printing on your system:

这些代码是用于单个左引号和右引号字符的 Unicode。你可以用它们的 ASCII 等价物替换它们,Python 在你的系统上打印应该没有任何问题:

>>> print u"\u2018Hi\u2019"
‘Hi'
>>> print u"\u2018Hi\u2019".replace(u"\u2018", "'").replace(u"\u2019", "'")
'Hi'

Alternatively with regex:

或者使用正则表达式:

import re
s = u"\u2018Hi\u2019"
>>> print re.sub(u"(\u2018|\u2019)", "'", s)
'Hi'

However Python shouldn't have any problem printing the Unicode version of these as well. It's possible that you are using str()somewhere which will try to convert your unicode to ascii and throw your exception.

然而,Python 打印这些的 Unicode 版本也不应该有任何问题。您可能正在使用str()某处尝试将您的 unicode 转换为 ascii 并抛出您的异常。