如何在 Python 中使用 unicode

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/752998/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:47:39  来源:igfitidea点击:

How to work with unicode in Python

pythonstringunicodereplaceunicode-string

提问by PyNEwbie

I am trying to clean all of the HTML out of a string so the final output is a text file. I have some some research on the various 'converters' and am starting to lean towards creating my own dictionary for the entities and symbols and running a replace on the string. I am considering this because I want to automate the process and there is a lot of variability in the quality of the underlying html. To begin comparing the speed of my solution and one of the alternatives for example pyparsing I decided to test replace of \xa0 using the string method replace. I get a

我试图从字符串中清除所有 HTML,因此最终输出是一个文本文件。我对各种“转换器”进行了一些研究,并且开始倾向于为实体和符号创建我自己的字典并在字符串上运行替换。我正在考虑这一点,因为我想自动化该过程,并且底层 html 的质量存在很大差异。为了开始比较我的解决方案和替代方案之一的速度,例如 pyparsing,我决定使用字符串方法替换来测试 \xa0 的替换。我得到一个

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

The actual line of code was

实际的代码行是

s=unicodestring.replace('\xa0','')

Anyway-I decided that I needed to preface it with an r so I ran this line of code:

无论如何 - 我决定我需要以 r 开头,所以我运行了这行代码:

s=unicodestring.replace(r'\xa0','')

It runs without error but I when I look at a slice of s I see that the \xaO is still there

它运行时没有错误,但是当我查看 s 的一部分时,我看到 \xaO 仍然存在

回答by z33m

may be you should be doing

可能是你应该做的

s=unicodestring.replace(u'\xa0',u'')

回答by dbr

s=unicodestring.replace('\xa0','')

..is trying to create the unicode character \xa0, which is not valid in an ASCII sctring (the default string type in Python until version 3.x)

.. 正在尝试创建 unicode 字符\xa0,该字符在 ASCII 字符串中无效(Python 中的默认字符串类型,直到 3.x 版)

The reason r'\xa0'did not error is because in a raw string, escape sequences have no effect. Rather than trying to encode \xa0into the unicode character, it saw the string as a "literal backslash", "literal x" and so on..

r'\xa0'没有错误的原因是因为在原始字符串中,转义序列不起作用。它没有尝试编码\xa0为 unicode 字符,而是将字符串视为“文字反斜杠”、“文字 x”等。

The following are the same:

以下是相同的:

>>> r'\xa0'
'\xa0'
>>> '\xa0'
'\xa0'

This is something resolved in Python v3, as the default string type is unicode, so you can just do..

这是在 Python v3 中解决的问题,因为默认的字符串类型是 unicode,所以你可以做..

>>> '\xa0'
'\xa0'

I am trying to clean all of the HTML out of a string so the final output is a text file

我正在尝试从字符串中清除所有 HTML,因此最终输出是一个文本文件

I would strongly recommend BeautifulSoupfor this. Writing an HTML cleaning tool is difficult (given how horrible most HTML is), and BeautifulSoup does a great job at both parsing HTML, and dealing with Unicode..

为此,我强烈推荐BeautifulSoup。编写 HTML 清理工具很困难(考虑到大多数 HTML 是多么可怕),而 BeautifulSoup 在解析 HTML 和处理 Unicode 方面都做得很好。

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<html><body><h1>Hi</h1></body></html>")
>>> print soup.prettify()
<html>
 <body>
  <h1>
   Hi
  </h1>
 </body>
</html>

回答by Wayne Koorts

Look at the codecsstandard library, specifically the encodeand decodemethods provided in the Codec base class.

查看codecs标准库,特别是 Codec 基类中提供的编码解码方法。

There's also a good article herethat puts it all together.

还有一个很好的文章在这里将所有这些在一起。

回答by Tejas Tank

Instead of this, it's better to use standard python features.

相反,最好使用标准的 Python 功能。

For example:

例如:

string = unicode('Hello, \xa0World', 'utf-8', 'replace')

or

或者

string = unicode('Hello, \xa0World', 'utf-8', 'ignore')

where replacewill replace \xa0to \\xa0.

在哪里replace将替换\xa0\\xa0.

But if \xa0is really not meaningful for you and you want to remove it then use ignore.

但是如果\xa0对你来说真的没有意义并且你想删除它,那么使用ignore.

回答by ólafur Waage

Just a note regarding HTML cleaning. It is very very hard, since

只是关于 HTML 清理的说明。这非常非常困难,因为

<
body
>

Is a valid way to write HTML. Just an fyi.

是编写 HTML 的有效方式。仅供参考。

回答by Jason Coon

You can convert it to unicode in this way:

您可以通过以下方式将其转换为 unicode:

print u'Hello, \xa0World'  # print Hello,  World