python 返回 unicode 字符串的前 N 个字符

Question

提问by Jon Romero

I have a string in unicode and I need to return the first N characters. I am doing this:

我有一个 unicode 字符串，我需要返回前 N 个字符。我正在这样做：

result = unistring[:5]

but of course the length of unicode strings != length of characters. Any ideas? The only solution is using re?

但当然是 unicode 字符串的长度！= 字符的长度。有任何想法吗？唯一的解决方案是使用 re?

Edit: More info

编辑：更多信息

unistring = "Μεταλλικα" #Metallica written in Greek letters
result = unistring[:1]

returns-> ?

返回-> ?

I think that unicode strings are two bytes (char), that's why this thing happens. If I do:

我认为 unicode 字符串是两个字节（char），这就是为什么会发生这种情况。如果我做：

result = unistring[:2]

I get

我得到

M

which is correct, So, should I always slice*2 or should I convert to something?

这是正确的，那么，我应该总是切片 *2 还是应该转换为某些东西？

Answer 1

采纳答案by Tendayi Mawushe

Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (str) and Unicode strings (unicode).

不幸的是，由于 Python 3.0 之前的历史原因，有两种字符串类型。字节字符串 ( str) 和 Unicode 字符串 ( unicode)。

Prior to the unification in Python 3.0 there are two ways to declare a string literal: unistring = "Μεταλλικα"which is a byte string and unistring = u"Μεταλλικα"which is a unicode string.

在 Python 3.0 统一之前，有两种方法可以声明字符串文字：unistring = "Μεταλλικα"一种是字节字符串，unistring = u"Μεταλλικα"一种是 unicode 字符串。

The reason you see ?when you do result = unistring[:1]is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example.

?您这样做的原因result = unistring[:1]是因为您的 Unicode 文本中的某些字符无法在非 Unicode 字符串中正确表示。如果您曾经使用过非常老的电子邮件客户端并收到来自希腊等国家/地区的朋友的电子邮件，那么您可能已经遇到过这种问题。

So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO

因此，在 Python 2.x 中，如果您需要处理 Unicode，则必须明确地进行处理。看看这个在 Python 中处理 Unicode 的介绍：Unicode HOWTO

Answer 2

回答by Thomas Wouters

When you say:

当你说：

unistring = "Μεταλλικα" #Metallica written in Greek letters

You do not havea unicode string. You have a bytestring in (presumably) UTF-8. That is not the same thing. A unicode string is a separate datatype in Python. You get unicode by decoding bytestrings using the right encoding:

您没有unicode 字符串。您有一个（大概）UTF-8 格式的字节串。那不是一回事。unicode 字符串是 Python 中的一种单独数据类型。您可以通过使用正确的编码解码字节串来获得 unicode：

unistring = "Μεταλλικα".decode('utf-8')

or by using the unicode literal in a source file with the right encoding declaration

或者通过在源文件中使用具有正确编码声明的 unicode 文字

# coding: UTF-8
unistring = u"Μεταλλικα"

The unicode string will do what you want when you do unistring[:5].

unicode 字符串将在您执行时执行您想要的操作unistring[:5]。

Answer 3

回答by Artyom

There is no correct straight-forward approach with any type of "Unicode string".

对于任何类型的“Unicode 字符串”，都没有正确的直接方法。

Even Python "Unicode" UTF-16 string has variable length characters so, you can't just cut with ustring[:5]. Because some Unicode Code points may use more then one "character" i.e. Surrogate pairs.

甚至 Python “Unicode” UTF-16 字符串也有可变长度的字符，因此，您不能只使用 ustring[:5] 进行剪切。因为某些 Unicode 代码点可能会使用多个“字符”，即代理对。

So if you want to cut 5 code points(note these are not characters) so you may analyze the text, see http://en.wikipedia.org/wiki/UTF-8and http://en.wikipedia.org/wiki/UTF-16definitions. So you need to use some bit masks to figure out boundaries.

因此，如果您想削减 5 个代码点（注意这些不是字符）以便您可以分析文本，请参阅http://en.wikipedia.org/wiki/UTF-8和http://en.wikipedia.org/ wiki/UTF-16定义。所以你需要使用一些位掩码来找出边界。

Also you still do not get characters. Because for example. Word "??????" -- peace in Hebrew "Shalom" consists of 4 characters and 6 code points letter "shin", vowel "a" letter "lamed", letter "vav" and vowel "o" and final letter "mem".

你仍然没有得到字符。因为例如。单词 ”？？？？？？” -- 希伯来语“Shalom”中的和平由 4 个字符和 6 个代码点字母“shin”、元音“a”字母“lamed”、字母“vav”和元音“o”以及最后一个字母“mem”组成。

So characteris not code point.

所以字符不是代码点。

Same for most western languages where a letter with diacritics may be represented as two code points. Search for example for "unicode normalization".

大多数西方语言也是如此，其中带有变音符号的字母可以表示为两个代码点。例如搜索“unicode normalization”。

So... If you really need 5 first characters you have to use tools like ICU library. For example there is ICU library for Python that provides characters boundary iterator.

所以...如果你真的需要 5 个第一个字符，你必须使用像 ICU 库这样的工具。例如，有提供字符边界迭代器的 Python ICU 库。

python 返回 unicode 字符串的前 N 个字符

提问by Jon Romero

采纳答案by Tendayi Mawushe

回答by Thomas Wouters

回答by Artyom

相关推荐

最近更新

标签

python 返回 unicode 字符串的前 N ​​个字符

提问by Jon Romero

采纳答案by Tendayi Mawushe

回答by Thomas Wouters

回答by Artyom

相关推荐

python 从数组中选择每行中的特定列

在 Python 中处理单值元组的最佳实践是什么？

python 如何修复 PyDev“方法应该将 self 作为第一个参数”错误

python 如何将 db.Model 对象序列化为 json？

相关推荐

最近更新

标签

python 返回 unicode 字符串的前 N 个字符