python 读取字符时python中的UTF-8问题

Question

提问by jacob

I'm using Python 2.5. What is going on here? What have I misunderstood? How can I fix it?

我正在使用 Python 2.5。这里发生了什么？我误解了什么？我该如何解决？

in.txt:

在.txt：

St?ck?vérfl?w

code.py

代码.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = open('in.txt','r')
for line in f:
    print line
    for i in line:
        print i,
f.close()

output:

输出：

St?ck?vérfl?w

S t ? ? c k ? ? v ? ? r f l ? ? w

Answer 1

回答by Miles

for i in line:
    print i,

When you read the file, the string you read in is a string of bytes. The for loop iterates over a single byte at a time. This causes problems with a UTF-8 encoded string, where non-ASCII characters are represented by multiple bytes. If you want to work with Unicode objects, where the characters are the basic pieces, you should use

当您读取文件时，您读入的字符串是一串字节。for 循环一次迭代一个字节。这会导致 UTF-8 编码字符串出现问题，其中非 ASCII 字符由多个字节表示。如果您想使用 Unicode 对象，其中字符是基本部分，您应该使用

import codecs
f = codecs.open('in', 'r', 'utf8')

If sys.stdoutdoesn't already have the appropriate encoding set, you may have to wrap it:

如果sys.stdout还没有合适的编码集，您可能需要对其进行包装：

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

Answer 2

回答by mhawke

Use codecs.open instead, it works for me.

改用 codecs.open，它对我有用。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = codecs.open('in','r','utf8')
for line in f:
    print line
    for i in line:
        print i,
f.close()

Answer 3

回答by mikl

Check this out:

看一下这个：

# -*- coding: utf-8 -*-
import pprint
f = open('unicode.txt','r')
for line in f:
    print line
    pprint.pprint(line)
    for i in line:
        print i,
f.close()

It returns this:

它返回这个：

St?ck?vérfl?w
'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
S t ? ? c k ? ? v ? ? r f l ? ? w

St?ck?vérfl?w
'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
S t ? ? 克？? ? ? rfl ? ? 瓦

The thing is that the file is just being read as a string of bytes. Iterating over them splits the multibyte characters into nonsensical byte values.

问题是该文件只是作为一串字节被读取。迭代它们会将多字节字符拆分为无意义的字节值。

Answer 4

回答by Artyom

print c,

Adds a "blank charrecter" and breaks correct utf-8 sequences into incorrect one. So this would not work unless you write a signle byte to output

添加“空白字符”并将正确的 utf-8 序列分解为不正确的序列。所以除非你写一个信号字节来输出，否则这将不起作用

sys.stdout.write(i)

Answer 5

回答by j1k00

One may want to just use

一个人可能只想使用

f = open('in.txt','r')
for line in f:
    print line
    for i in line.decode('utf-8'):
        print i,
f.close()

python 读取字符时python中的UTF-8问题

提问by jacob

回答by Miles

回答by mhawke

回答by mikl

回答by Artyom

回答by j1k00

相关推荐

最近更新

标签

python 读取字符时python中的UTF-8问题

提问by jacob

回答by Miles

回答by mhawke

回答by mikl

回答by Artyom

回答by j1k00

相关推荐

从 python subprocess.Popen(command, stderr=subprocess.PIPE, stdout=subprocess.PIPE) 捕获 stderr

python python多处理的生产者/消费者问题

检测到 Swig/Python 内存泄漏

使用 Python 进行 SSO 的 SPNEGO（kerberos 令牌生成/验证）

相关推荐

最近更新

标签