Python 如何解码中文文本中的unicode

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33294213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:10:05  来源:igfitidea点击:

How to decode unicode in a Chinese text

pythonunicode

提问by YAL

with open('result.txt', 'r') as f:
data = f.read()

print 'What type is my data:'
print type(data)

for i in data:
    print "what is i:"
    print i
    print "what type is i"
    print type(i)


    print i.encode('utf-8')

I have file with string and I am trying to read the file and split the words by space and save them into a list. Below is my code:

我有带字符串的文件,我正在尝试读取文件并按空格拆分单词并将它们保存到列表中。下面是我的代码:

Below is my error messages: enter image description here

以下是我的错误信息: enter image description here

Someone please help!

有人请帮忙!

Update:

更新:

I am going to describe what I am trying to do in details here so it give people more context: The goal of what I am trying to do is: 1. Take a Chinese text and break it down into sentences with detecting basic ending punctuations. 2. Take each sentence and use the tool jieba to tokenize characters into meaningful words. For instances, two Chinese character 學,生, will be group together to produce a token '學生' (meaning student). 3. Save all the tokens from the sentence into a list. So the final list will have multiple lists inside as there are multiple sentences in a paragraph.

我将在这里详细描述我想要做的事情,以便为人们提供更多上下文: 我尝试做的目标是: 1. 将中文文本分解成句子,并检测基本的结尾标点符号。2. 取每个句子,使用工具 jieba 将字符标记为有意义的单词。例如,两个汉字学,生,将组合在一起产生一个标记“学生”(意思是学生)。3. 将句子中的所有标记保存到列表中。所以最终的列表里面会有多个列表,因为一个段落中有多个句子。

# coding: utf-8 
#encoding=utf-8

import jieba

cutlist = "。!?".decode('utf-8')
test = "【明報專訊】「吉野家」and Peter from US因被誤傳採用日本福島米而要報警澄清,並自爆用內地黑龍江米,日本料理食材來源惹關注。本報以顧客身分向6間日式食店查詢白米產地,其中出售逾200元日式豬扒飯套餐的「勝博殿日式炸豬排」也選用中國大連米,誤以為該店用日本米的食客稱「要諗吓會否再幫襯」,亦有食客稱「好食就得」;壽司店「板長」店員稱採用香港米,公關其後澄清來源地是澳洲,即與平價壽司店「爭鮮」一樣。有飲食界人士稱,雖然日本米較貴、品質較佳,但內地米品質亦有保證。"

#FindToken check whether the character has the ending punctuation
def FindToken(cutlist, char):
    if char in cutlist:
        return True
    else:
        return False

'''
cut check each item in a string list, if the item is not the ending punctuation, it will save it to a temporary list called line. When the ending punctuation is encountered it will save the complete sentence that has been collected in the list line into the final list. '''

'''
cut 检查字符串列表中的每一项,如果该项不是结束标点,则将其保存到名为 line 的临时列表中。当遇到结束标点时,它将在列表行中收集到的完整句子保存到最终列表中。'''

def cut(cutlist,test):
    l = []
    line = []
    final = []

'''
check each item in a string list, if the item is not the ending punchuation , it will save it to a temporary list called line. When the ending punchuation is encountered it will save the complete sentence that has been collected in the list line into the final list. '''

'''
检查字符串列表中的每个项目,如果该项目不是结束打孔,则将其保存到名为 line 的临时列表中。当遇到结束打孔时,它会将列表行中收集的完整句子保存到最终列表中。'''

    for i in test:
        if i == ' ':
            line.append(i)

        elif FindToken(cutlist,i):
            line.append(i)
            l.append(''.join(line))
            line = []
        else:
            line.append(i)

    temp = [] 
    #This part iterate each complete sentence and then group characters according to its context.
    for i in l:
        #This is the function that break down a sentence of characters and group them into phrases
        process = list(jieba.cut(i, cut_all=False))

        #This is puting all the tokenized character phrases of a sentence into a list. Each sentence 
        #belong to one list.
        for j in process:
            temp.append(j.encode('utf-8')) 
            #temp.append(j) 
        print temp 

        final.append(temp)
        temp = [] 
    return final 


cut(list(cutlist),list(test.decode('utf-8')))

Here is my problem, when I output my final list, it gives me a list of the following result:

这是我的问题,当我输出最终列表时,它给了我以下结果的列表:

[u'\u3010', u'\u660e\u5831', u'\u5c08\u8a0a', u'\u3011', u'\u300c', u'\u5409\u91ce\u5bb6', u'\u300d', u'and', u' ', u'Peter', u' ', u'from', u' ', u'US', u'\u56e0', u'\u88ab', u'\u8aa4\u50b3', u'\u63a1\u7528', u'\u65e5\u672c', u'\u798f\u5cf6', u'\u7c73', u'\u800c', u'\u8981', u'\u5831\u8b66', u'\u6f84\u6e05', u'\uff0c', u'\u4e26', u'\u81ea\u7206', u'\u7528\u5167', u'\u5730', u'\u9ed1\u9f8d', u'\u6c5f\u7c73', u'\uff0c', u'\u65e5\u672c\u6599\u7406', u'\u98df\u6750', u'\u4f86\u6e90', u'\u60f9', u'\u95dc\u6ce8', u'\u3002']

How can I turn a list of unicode into normal string?

如何将 unicode 列表转换为普通字符串?

回答by ShadowRanger

When you call encodeon a strwith most (all?) codecs (for which encodereally makes no sense; stris a byte oriented type, not a true text type like unicodethat would require encoding), Python is implicitly decodeing it as ASCII first, then encoding with your specified encoding. If you want the strto be interpreted as something other than ASCII, you need to decodefrom bytes-like strto true text unicodeyourself.

当您使用大多数(所有?)编解码器调用encodeastr时(这encode真的没有意义;str是面向字节的类型,而不是像unicode那样需要编码的真正文本类型),Python 首先将其隐式decode编码为 ASCII,然后使用您指定的编码。如果您希望str将 解释为 ASCII 以外的其他内容,您需要自己将decode类似字节的str文本转换为真正的文本unicode

When you do i.encode('utf-8')when iis a str, you're implicitly saying iis logically text (represented by bytes in the locale default encoding), not binary data. So in order to encodeit, it first needs to decode it to determine what the "logical" text is. Your input is probably encoded in some ASCIIsuperset (e.g. latin-1, or even utf-8), and contains non-ASCII bytes; it tries to decodethem using the asciicodec (to figure out the true Unicode ordinals it needs to encode as utf-8), and fails.

当您执行i.encode('utf-8')when iis a 时str,您隐含地说i是逻辑文本(在语言环境默认编码中由字节表示),而不是二进制数据。因此,为了实现encode它,它首先需要对其进行解码以确定“逻辑”文本是什么。您的输入可能在某些ASCII超集(例如latin-1,甚至utf-8)中编码,并且包含非 ASCII 字节;它尝试decode使用ascii编解码器(以找出它需要编码为 的真正 Unicode 序数utf-8),但失败了。

You need to do one of:

您需要执行以下操作之一:

  1. Explicitly decodethe stryou read using the correct codec (to get a unicodeobject), then encodethat back to utf-8.
  2. Let Python do the work from #1 for you implicitly. Instead of using open, import ioand use io.open(Python 2.7+ only; on Python 3+, io.openand openare the same function), which gets you an openthat works like Python 3's open. You can pass this openan encodingargument (e.g. io.open('/path/to/file', 'r', encoding='latin-1')) and reading from the resulting file object will get you already decode-ed unicodeobjects (that can then be encode-ed to whatever you like with).
  1. 明确decodestr你使用的是正确的编解码器读取(获得一个unicode对象),然后encode是回utf-8
  2. 让 Python 为您完成 #1 中的工作。而不是使用open,import io并使用io.open(仅限 Python 2.7+;在 Python 3+ 上,io.open并且open是相同的功能),这会让您open像 Python 3 的open. 您可以传递open一个encoding参数(例如io.open('/path/to/file', 'r', encoding='latin-1')),并且read从结果文件对象中 ing 将获得您已经decode-ed 的unicode对象(然后可以encode-ed 为您喜欢的任何内容)。

Note: #1 will not work if the real encoding is something like utf-8and you defer the work until you're iterating character by character. For non-ASCII characters, utf-8is multibyte, so if you only have one byte, you can't decode(because the following bytes are needed to calculate a single ordinal). This is a reason to prefer using io.opento read as unicodenatively so you're not worrying about stuff like this.

注意:如果真正的编码是类似的,utf-8并且您将工作推迟到逐个字符进行迭代,则 #1 将不起作用。对于非 ASCII 字符,utf-8是多字节的,因此如果您只有一个字节,则不能decode(因为需要以下字节来计算单个序数)。这是一个更喜欢使用本机io.open阅读的原因,unicode因此您不必担心这样的事情。

回答by mpontillo

Let me give you some hints:

让我给你一些提示:

  • You'll need to decode the bytes you read from UTF-8 into Unicode beforeyou try to iterate over the words.
  • When you read a file, you won't get Unicode back. You'll just get plain bytes. (I think you knew that, since you're already using decode().)
  • There is a standard function to "split by space" called split().
  • When you say for i in data, you're saying you want to iterate over every byte of the file you just read. Each iteration of your loop will be a single character. I'm not sure if that's what you want, because that would mean you'd have to do UTF-8 decoding by hand (rather than using decode(), which must operate on the entire UTF-8 string.).
  • 尝试遍历单词之前,您需要将从 UTF-8 读取的字节解码为 Unicode 。
  • 当您读取文件时,您将无法恢复 Unicode。你只会得到普通的字节。(我想您知道这一点,因为您已经在使用decode().)
  • 有一个标准函数可以“按空间分割”,称为split().
  • 当您说 时for i in data,您是在说您要遍历刚刚读取的文件的每个字节。循环的每次迭代都是一个字符。我不确定这是否是您想要的,因为这意味着您必须手动进行 UTF-8 解码(而不是使用decode(),它必须对整个 UTF-8 字符串进行操作。)。

In other words, here's one line of code that would do it:

换句话说,这里有一行代码可以做到:

open('file.txt').read().decode('utf-8').split()

If this is homework, please don't turn that in. Your teacher will be onto you. ;-)

如果这是家庭作业,请不要上交。你的老师会找你的。;-)



Edit: Here's an example how to encode and decode unicode characters in python:

编辑:这是一个如何在 python 中编码和解码 unicode 字符的示例:

>>> data = u"わかりません"
>>> data
u'\u308f\u304b\u308a\u307e\u305b\u3093'
>>> data_you_would_see_in_a_file = data.encode('utf-8')
>>> data_you_would_see_in_a_file
'\xe3\x82\x8f\xe3\x81\x8b\xe3\x82\x8a\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93'
>>> for each_unicode_character in data_you_would_see_in_a_file.decode('utf-8'):
...     print each_unicode_character
... 
わ
か
り
ま
せ
ん

The first thing to note is that Python (well, at least Python 2) uses the u""notation (note the uprefix) on string constants to show that they are Unicode. In Python 3, strings are Unicode by default, but you can use b""if you want a byte string.

首先要注意的是 Python(好吧,至少是 Python 2 )在字符串常量上使用u""符号(注意u前缀)来表明它们是 Unicode。在 Python 3 中,字符串默认为 Unicode,但b""如果需要字节字符串,也可以使用。

As you can see, the Unicode string is composed of two-byte characters. When you read the file, you get a string of one-byte characters (which is equivalent to what you get when you call .encode(). So if you have bytes from a file, you must call .decode()to convert them back into Unicode. Then you can iterate over each character.

如您所见,Unicode 字符串由两字节字符组成。当你读取文件时,你会得到一个一字节字符的字符串(这相当于你调用时得到的.encode()。所以如果你有文件中的字节,你必须调用.decode()将它们转换回Unicode。然后你可以迭代在每个字符上。

Splitting "by space" is something unique to every language, since many languages (for example, Chinese and Japanese) do not uses the ' 'character, like most European languages would. I don't know how to do that in Python off the top of my head, but I'm sure there is a way.

“按空格”拆分对每种语言来说都是独一无二的,因为许多语言(例如,中文和日语)' '不像大多数欧洲语言那样使用该字符。我不知道如何在 Python 中做到这一点,但我确信有一种方法。

回答by jfs

datais a bytestring (strtype on Python 2). Your loop looks at one byte at a time (non-ascii characters may be represented using more than one byte in utf-8).

data是一个字节串(strPython 2 上的类型)。您的循环一次查看一个字节(非 ascii 字符可以在 utf-8 中使用多个字节表示)。

Don't call .encode()on bytes:

不要调用.encode()字节:

$ python2
>>> '\xe3'.en?odе('utf?8') #XXX don't do it
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

I am trying to read the file and split the words by space and save them into a list.

我正在尝试读取文件并按空格拆分单词并将它们保存到列表中。

To work with Unicode text, use unicodetype in Python 2. You could use io.open()to read Unicode text from a file (here's the code that collects all space-separated words into a list):

要处理 Unicode 文本,请使用unicodePython 2 中的 type。您可以使用io.open()从文件中读取 Unicode 文本(这是将所有以空格分隔的单词收集到列表中的代码):

#!/usr/bin/env python
import io

with io.open('result.txt', encoding='utf-8') as file:
    words = [word for line in file for word in line.split()]
print "\n".join(words)

回答by Nianliang

Encoding:

编码:

$ python
Python 3.7.4 (default, Aug 13 2019, 15:17:50)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import base64
>>> base64.b64encode("我们尊重原创。".encode('utf-8'))
b'5theitroad5Lus5bCK6YeN5Y6f5Yib44CC'

Decoding:

解码:

>>> import base64
>>> str='5theitroad5Lus5bCK6YeN5Y6f5Yib44CC'
>>> base64.b64decode(str)
b'\xe6\x88\x91\xe4\xbb\xac\xe5\xb0\x8a\xe9\x87\x8d\xe5\x8e\x9f\xe5\x88\x9b\xe3\x80\x82'
>>> base64.b64decode(str).decode('utf-8')
'我们尊重原创。'
>>>