python 读取字符时python中的UTF-8问题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/985486/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 21:12:42  来源:igfitidea点击:

UTF-8 problem in python when reading chars

pythonutf-8

提问by jacob

I'm using Python 2.5. What is going on here? What have I misunderstood? How can I fix it?

我正在使用 Python 2.5。这里发生了什么?我误解了什么?我该如何解决?

in.txt:

在.txt:

St?ck?vérfl?w

code.py

代码.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = open('in.txt','r')
for line in f:
    print line
    for i in line:
        print i,
f.close()

output:

输出:

St?ck?vérfl?w

S t ? ? c k ? ? v ? ? r f l ? ? w 

回答by Miles

for i in line:
    print i,

When you read the file, the string you read in is a string of bytes. The for loop iterates over a single byte at a time. This causes problems with a UTF-8 encoded string, where non-ASCII characters are represented by multiple bytes. If you want to work with Unicode objects, where the characters are the basic pieces, you should use

当您读取文件时,您读入的字符串是一串字节。for 循环一次迭代一个字节。这会导致 UTF-8 编码字符串出现问题,其中非 ASCII 字符由多个字节表示。如果您想使用 Unicode 对象,其中字符是基本部分,您应该使用

import codecs
f = codecs.open('in', 'r', 'utf8')

If sys.stdoutdoesn't already have the appropriate encoding set, you may have to wrap it:

如果sys.stdout还没有合适的编码集,您可能需要对其进行包装:

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

回答by mhawke

Use codecs.open instead, it works for me.

改用 codecs.open,它对我有用。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = codecs.open('in','r','utf8')
for line in f:
    print line
    for i in line:
        print i,
f.close()

回答by mikl

Check this out:

看一下这个:

# -*- coding: utf-8 -*-
import pprint
f = open('unicode.txt','r')
for line in f:
    print line
    pprint.pprint(line)
    for i in line:
        print i,
f.close()

It returns this:

它返回这个:

St?ck?vérfl?w
'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
S t ? ? c k ? ? v ? ? r f l ? ? w

St?ck?vérfl?w
'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
S t ? ? 克?? ? ? rfl ? ? 瓦

The thing is that the file is just being read as a string of bytes. Iterating over them splits the multibyte characters into nonsensical byte values.

问题是该文件只是作为一串字节被读取。迭代它们会将多字节字符拆分为无意义的字节值。

回答by Artyom

print c,

Adds a "blank charrecter" and breaks correct utf-8 sequences into incorrect one. So this would not work unless you write a signle byte to output

添加“空白字符”并将正确的 utf-8 序列分解为不正确的序列。所以除非你写一个信号字节来输出,否则这将不起作用

sys.stdout.write(i)

回答by j1k00

One may want to just use

一个人可能只想使用

f = open('in.txt','r')
for line in f:
    print line
    for i in line.decode('utf-8'):
        print i,
f.close()