python 读取字符时python中的UTF-8问题
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/985486/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UTF-8 problem in python when reading chars
提问by jacob
I'm using Python 2.5. What is going on here? What have I misunderstood? How can I fix it?
我正在使用 Python 2.5。这里发生了什么?我误解了什么?我该如何解决?
in.txt:
在.txt:
St?ck?vérfl?w
code.py
代码.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = open('in.txt','r')
for line in f:
print line
for i in line:
print i,
f.close()
output:
输出:
St?ck?vérfl?w
S t ? ? c k ? ? v ? ? r f l ? ? w
回答by Miles
for i in line:
print i,
When you read the file, the string you read in is a string of bytes. The for loop iterates over a single byte at a time. This causes problems with a UTF-8 encoded string, where non-ASCII characters are represented by multiple bytes. If you want to work with Unicode objects, where the characters are the basic pieces, you should use
当您读取文件时,您读入的字符串是一串字节。for 循环一次迭代一个字节。这会导致 UTF-8 编码字符串出现问题,其中非 ASCII 字符由多个字节表示。如果您想使用 Unicode 对象,其中字符是基本部分,您应该使用
import codecs
f = codecs.open('in', 'r', 'utf8')
If sys.stdout
doesn't already have the appropriate encoding set, you may have to wrap it:
如果sys.stdout
还没有合适的编码集,您可能需要对其进行包装:
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
回答by mhawke
Use codecs.open instead, it works for me.
改用 codecs.open,它对我有用。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = codecs.open('in','r','utf8')
for line in f:
print line
for i in line:
print i,
f.close()
回答by mikl
Check this out:
看一下这个:
# -*- coding: utf-8 -*-
import pprint
f = open('unicode.txt','r')
for line in f:
print line
pprint.pprint(line)
for i in line:
print i,
f.close()
It returns this:
它返回这个:
St?ck?vérfl?w
'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
S t ? ? c k ? ? v ? ? r f l ? ? w
St?ck?vérfl?w
'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
S t ? ? 克?? ? ? rfl ? ? 瓦
The thing is that the file is just being read as a string of bytes. Iterating over them splits the multibyte characters into nonsensical byte values.
问题是该文件只是作为一串字节被读取。迭代它们会将多字节字符拆分为无意义的字节值。
回答by Artyom
print c,
Adds a "blank charrecter" and breaks correct utf-8 sequences into incorrect one. So this would not work unless you write a signle byte to output
添加“空白字符”并将正确的 utf-8 序列分解为不正确的序列。所以除非你写一个信号字节来输出,否则这将不起作用
sys.stdout.write(i)
回答by j1k00
One may want to just use
一个人可能只想使用
f = open('in.txt','r')
for line in f:
print line
for i in line.decode('utf-8'):
print i,
f.close()