Python 如何解码中文文本中的unicode
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33294213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to decode unicode in a Chinese text
提问by YAL
with open('result.txt', 'r') as f:
data = f.read()
print 'What type is my data:'
print type(data)
for i in data:
print "what is i:"
print i
print "what type is i"
print type(i)
print i.encode('utf-8')
I have file with string and I am trying to read the file and split the words by space and save them into a list. Below is my code:
我有带字符串的文件,我正在尝试读取文件并按空格拆分单词并将它们保存到列表中。下面是我的代码:
Someone please help!
有人请帮忙!
Update:
更新:
I am going to describe what I am trying to do in details here so it give people more context: The goal of what I am trying to do is: 1. Take a Chinese text and break it down into sentences with detecting basic ending punctuations. 2. Take each sentence and use the tool jieba to tokenize characters into meaningful words. For instances, two Chinese character 學,生, will be group together to produce a token '學生' (meaning student). 3. Save all the tokens from the sentence into a list. So the final list will have multiple lists inside as there are multiple sentences in a paragraph.
我将在这里详细描述我想要做的事情,以便为人们提供更多上下文: 我尝试做的目标是: 1. 将中文文本分解成句子,并检测基本的结尾标点符号。2. 取每个句子,使用工具 jieba 将字符标记为有意义的单词。例如,两个汉字学,生,将组合在一起产生一个标记“学生”(意思是学生)。3. 将句子中的所有标记保存到列表中。所以最终的列表里面会有多个列表,因为一个段落中有多个句子。
# coding: utf-8
#encoding=utf-8
import jieba
cutlist = "。!?".decode('utf-8')
test = "【明報專訊】「吉野家」and Peter from US因被誤傳採用日本福島米而要報警澄清,並自爆用內地黑龍江米,日本料理食材來源惹關注。本報以顧客身分向6間日式食店查詢白米產地,其中出售逾200元日式豬扒飯套餐的「勝博殿日式炸豬排」也選用中國大連米,誤以為該店用日本米的食客稱「要諗吓會否再幫襯」,亦有食客稱「好食就得」;壽司店「板長」店員稱採用香港米,公關其後澄清來源地是澳洲,即與平價壽司店「爭鮮」一樣。有飲食界人士稱,雖然日本米較貴、品質較佳,但內地米品質亦有保證。"
#FindToken check whether the character has the ending punctuation
def FindToken(cutlist, char):
if char in cutlist:
return True
else:
return False
'''
cut check each item in a string list, if the item is not the ending punctuation, it will save it to a temporary list called line. When the ending punctuation is encountered it will save the complete sentence that has been collected in the list line into the final list.
'''
'''
cut 检查字符串列表中的每一项,如果该项不是结束标点,则将其保存到名为 line 的临时列表中。当遇到结束标点时,它将在列表行中收集到的完整句子保存到最终列表中。'''
def cut(cutlist,test):
l = []
line = []
final = []
'''
check each item in a string list, if the item is not the ending punchuation , it
will save it to a temporary list called line. When the ending punchuation is encountered it will save the complete sentence that has been collected in the list line into the final list.
'''
'''
检查字符串列表中的每个项目,如果该项目不是结束打孔,则将其保存到名为 line 的临时列表中。当遇到结束打孔时,它会将列表行中收集的完整句子保存到最终列表中。'''
for i in test:
if i == ' ':
line.append(i)
elif FindToken(cutlist,i):
line.append(i)
l.append(''.join(line))
line = []
else:
line.append(i)
temp = []
#This part iterate each complete sentence and then group characters according to its context.
for i in l:
#This is the function that break down a sentence of characters and group them into phrases
process = list(jieba.cut(i, cut_all=False))
#This is puting all the tokenized character phrases of a sentence into a list. Each sentence
#belong to one list.
for j in process:
temp.append(j.encode('utf-8'))
#temp.append(j)
print temp
final.append(temp)
temp = []
return final
cut(list(cutlist),list(test.decode('utf-8')))
Here is my problem, when I output my final list, it gives me a list of the following result:
这是我的问题,当我输出最终列表时,它给了我以下结果的列表:
[u'\u3010', u'\u660e\u5831', u'\u5c08\u8a0a', u'\u3011', u'\u300c', u'\u5409\u91ce\u5bb6', u'\u300d', u'and', u' ', u'Peter', u' ', u'from', u' ', u'US', u'\u56e0', u'\u88ab', u'\u8aa4\u50b3', u'\u63a1\u7528', u'\u65e5\u672c', u'\u798f\u5cf6', u'\u7c73', u'\u800c', u'\u8981', u'\u5831\u8b66', u'\u6f84\u6e05', u'\uff0c', u'\u4e26', u'\u81ea\u7206', u'\u7528\u5167', u'\u5730', u'\u9ed1\u9f8d', u'\u6c5f\u7c73', u'\uff0c', u'\u65e5\u672c\u6599\u7406', u'\u98df\u6750', u'\u4f86\u6e90', u'\u60f9', u'\u95dc\u6ce8', u'\u3002']
How can I turn a list of unicode into normal string?
如何将 unicode 列表转换为普通字符串?
回答by ShadowRanger
When you call encode
on a str
with most (all?) codecs (for which encode
really makes no sense; str
is a byte oriented type, not a true text type like unicode
that would require encoding), Python is implicitly decode
ing it as ASCII first, then encoding with your specified encoding. If you want the str
to be interpreted as something other than ASCII, you need to decode
from bytes-like str
to true text unicode
yourself.
当您使用大多数(所有?)编解码器调用encode
astr
时(这encode
真的没有意义;str
是面向字节的类型,而不是像unicode
那样需要编码的真正文本类型),Python 首先将其隐式decode
编码为 ASCII,然后使用您指定的编码。如果您希望str
将 解释为 ASCII 以外的其他内容,您需要自己将decode
类似字节的str
文本转换为真正的文本unicode
。
When you do i.encode('utf-8')
when i
is a str
, you're implicitly saying i
is logically text (represented by bytes in the locale default encoding), not binary data. So in order to encode
it, it first needs to decode it to determine what the "logical" text is. Your input is probably encoded in some ASCII
superset (e.g. latin-1
, or even utf-8
), and contains non-ASCII bytes; it tries to decode
them using the ascii
codec (to figure out the true Unicode ordinals it needs to encode as utf-8
), and fails.
当您执行i.encode('utf-8')
when i
is a 时str
,您隐含地说i
是逻辑文本(在语言环境默认编码中由字节表示),而不是二进制数据。因此,为了实现encode
它,它首先需要对其进行解码以确定“逻辑”文本是什么。您的输入可能在某些ASCII
超集(例如latin-1
,甚至utf-8
)中编码,并且包含非 ASCII 字节;它尝试decode
使用ascii
编解码器(以找出它需要编码为 的真正 Unicode 序数utf-8
),但失败了。
You need to do one of:
您需要执行以下操作之一:
- Explicitly
decode
thestr
you read using the correct codec (to get aunicode
object), thenencode
that back toutf-8
. - Let Python do the work from #1 for you implicitly. Instead of using
open
,import io
and useio.open
(Python 2.7+ only; on Python 3+,io.open
andopen
are the same function), which gets you anopen
that works like Python 3'sopen
. You can pass thisopen
anencoding
argument (e.g.io.open('/path/to/file', 'r', encoding='latin-1')
) andread
ing from the resulting file object will get you alreadydecode
-edunicode
objects (that can then beencode
-ed to whatever you like with).
- 明确
decode
了str
你使用的是正确的编解码器读取(获得一个unicode
对象),然后encode
是回utf-8
。 - 让 Python 为您完成 #1 中的工作。而不是使用
open
,import io
并使用io.open
(仅限 Python 2.7+;在 Python 3+ 上,io.open
并且open
是相同的功能),这会让您open
像 Python 3 的open
. 您可以传递open
一个encoding
参数(例如io.open('/path/to/file', 'r', encoding='latin-1')
),并且read
从结果文件对象中 ing 将获得您已经decode
-ed 的unicode
对象(然后可以encode
-ed 为您喜欢的任何内容)。
Note: #1 will not work if the real encoding is something like utf-8
and you defer the work until you're iterating character by character. For non-ASCII characters, utf-8
is multibyte, so if you only have one byte, you can't decode
(because the following bytes are needed to calculate a single ordinal). This is a reason to prefer using io.open
to read as unicode
natively so you're not worrying about stuff like this.
注意:如果真正的编码是类似的,utf-8
并且您将工作推迟到逐个字符进行迭代,则 #1 将不起作用。对于非 ASCII 字符,utf-8
是多字节的,因此如果您只有一个字节,则不能decode
(因为需要以下字节来计算单个序数)。这是一个更喜欢使用本机io.open
阅读的原因,unicode
因此您不必担心这样的事情。
回答by mpontillo
Let me give you some hints:
让我给你一些提示:
- You'll need to decode the bytes you read from UTF-8 into Unicode beforeyou try to iterate over the words.
- When you read a file, you won't get Unicode back. You'll just get plain bytes. (I think you knew that, since you're already using
decode()
.) - There is a standard function to "split by space" called
split()
. - When you say
for i in data
, you're saying you want to iterate over every byte of the file you just read. Each iteration of your loop will be a single character. I'm not sure if that's what you want, because that would mean you'd have to do UTF-8 decoding by hand (rather than usingdecode()
, which must operate on the entire UTF-8 string.).
- 在尝试遍历单词之前,您需要将从 UTF-8 读取的字节解码为 Unicode 。
- 当您读取文件时,您将无法恢复 Unicode。你只会得到普通的字节。(我想您知道这一点,因为您已经在使用
decode()
.) - 有一个标准函数可以“按空间分割”,称为
split()
. - 当您说 时
for i in data
,您是在说您要遍历刚刚读取的文件的每个字节。循环的每次迭代都是一个字符。我不确定这是否是您想要的,因为这意味着您必须手动进行 UTF-8 解码(而不是使用decode()
,它必须对整个 UTF-8 字符串进行操作。)。
In other words, here's one line of code that would do it:
换句话说,这里有一行代码可以做到:
open('file.txt').read().decode('utf-8').split()
If this is homework, please don't turn that in. Your teacher will be onto you. ;-)
如果这是家庭作业,请不要上交。你的老师会找你的。;-)
Edit: Here's an example how to encode and decode unicode characters in python:
编辑:这是一个如何在 python 中编码和解码 unicode 字符的示例:
>>> data = u"わかりません"
>>> data
u'\u308f\u304b\u308a\u307e\u305b\u3093'
>>> data_you_would_see_in_a_file = data.encode('utf-8')
>>> data_you_would_see_in_a_file
'\xe3\x82\x8f\xe3\x81\x8b\xe3\x82\x8a\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93'
>>> for each_unicode_character in data_you_would_see_in_a_file.decode('utf-8'):
... print each_unicode_character
...
わ
か
り
ま
せ
ん
The first thing to note is that Python (well, at least Python 2) uses the u""
notation (note the u
prefix) on string constants to show that they are Unicode. In Python 3, strings are Unicode by default, but you can use b""
if you want a byte string.
首先要注意的是 Python(好吧,至少是 Python 2 )在字符串常量上使用u""
符号(注意u
前缀)来表明它们是 Unicode。在 Python 3 中,字符串默认为 Unicode,但b""
如果需要字节字符串,也可以使用。
As you can see, the Unicode string is composed of two-byte characters. When you read the file, you get a string of one-byte characters (which is equivalent to what you get when you call .encode()
. So if you have bytes from a file, you must call .decode()
to convert them back into Unicode. Then you can iterate over each character.
如您所见,Unicode 字符串由两字节字符组成。当你读取文件时,你会得到一个一字节字符的字符串(这相当于你调用时得到的.encode()
。所以如果你有文件中的字节,你必须调用.decode()
将它们转换回Unicode。然后你可以迭代在每个字符上。
Splitting "by space" is something unique to every language, since many languages (for example, Chinese and Japanese) do not uses the ' '
character, like most European languages would. I don't know how to do that in Python off the top of my head, but I'm sure there is a way.
“按空格”拆分对每种语言来说都是独一无二的,因为许多语言(例如,中文和日语)' '
不像大多数欧洲语言那样使用该字符。我不知道如何在 Python 中做到这一点,但我确信有一种方法。
回答by jfs
data
is a bytestring (str
type on Python 2). Your loop looks at one byte at a time (non-ascii characters may be represented using more than one byte in utf-8).
data
是一个字节串(str
Python 2 上的类型)。您的循环一次查看一个字节(非 ascii 字符可以在 utf-8 中使用多个字节表示)。
Don't call .encode()
on bytes:
不要调用.encode()
字节:
$ python2
>>> '\xe3'.en?odе('utf?8') #XXX don't do it
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
I am trying to read the file and split the words by space and save them into a list.
我正在尝试读取文件并按空格拆分单词并将它们保存到列表中。
To work with Unicode text, use unicode
type in Python 2. You could use io.open()
to read Unicode text from a file (here's the code that collects all space-separated words into a list):
要处理 Unicode 文本,请使用unicode
Python 2 中的 type。您可以使用io.open()
从文件中读取 Unicode 文本(这是将所有以空格分隔的单词收集到列表中的代码):
#!/usr/bin/env python
import io
with io.open('result.txt', encoding='utf-8') as file:
words = [word for line in file for word in line.split()]
print "\n".join(words)
回答by Nianliang
Encoding:
编码:
$ python
Python 3.7.4 (default, Aug 13 2019, 15:17:50)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import base64
>>> base64.b64encode("我们尊重原创。".encode('utf-8'))
b'5theitroad5Lus5bCK6YeN5Y6f5Yib44CC'
Decoding:
解码:
>>> import base64
>>> str='5theitroad5Lus5bCK6YeN5Y6f5Yib44CC'
>>> base64.b64decode(str)
b'\xe6\x88\x91\xe4\xbb\xac\xe5\xb0\x8a\xe9\x87\x8d\xe5\x8e\x9f\xe5\x88\x9b\xe3\x80\x82'
>>> base64.b64decode(str).decode('utf-8')
'我们尊重原创。'
>>>