python：unicode问题

Question

提问by Oleg Tarasenko

I am trying to decode a string I took from file:

我正在尝试解码从文件中获取的字符串：

file = open ("./Downloads/lamp-post.csv", 'r')
data = file.readlines()
data[0]

'\xff\xfeK\x00e\x00y\x00w\x00o\x00r\x00d\x00\t\x00C\x00o\x00m\x00p\x00e\x00t\x00i\x00t\x00i\x00o\x00n\x00\t\x00G\x00l\x00o\x00b\x00a\x00l\x00 \x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00 \x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\t\x00D\x00e\x00c\x00 \x002\x000\x001\x000\x00\t\x00N\x00o\x00v\x00 \x002\x000\x001\x000\x00\t\x00O\x00c\x00t\x00 \x002\x000\x001\x000\x00\t\x00S\x00e\x00p\x00 \x002\x000\x001\x000\x00\t\x00A\x00u\x00g\x00 \x002\x000\x001\x000\x00\t\x00J\x00u\x00l\x00 \x002\x000\x001\x000\x00\t\x00J\x00u\x00n\x00 \x002\x000\x001\x000\x00\t\x00M\x00a\x00y\x00 \x002\x000\x001\x000\x00\t\x00A\x00p\x00r\x00 \x002\x000\x001\x000\x00\t\x00M\x00a\x00r\x00 \x002\x000\x001\x000\x00\t\x00F\x00e\x00b\x00 \x002\x000\x001\x000\x00\t\x00J\x00a\x00n\x00 \x002\x000\x001\x000\x00\t\x00A\x00d\x00 \x00s\x00h\x00a\x00r\x00e\x00\t\x00S\x00e\x00a\x00r\x00c\x00h\x00 \x00s\x00h\x00a\x00r\x00e\x00\t\x00E\x00s\x00t\x00i\x00m\x00a\x00t\x00e\x00d\x00 \x00A\x00v\x00g\x00.\x00 \x00C\x00P\x00C\x00\t\x00E\x00x\x00t\x00r\x00a\x00c\x00t\x00e\x00d\x00 \x00F\x00r\x00o\x00m\x00 \x00W\x00e\x00b\x00 \x00P\x00a\x00g\x00e\x00\t\x00L\x00o\x00c\x00a\x00l\x00 \x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00 \x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\n'

'\xff\xfeK\x00e\x00y\x00w\x00o\x00r\x00d\x00\t\x00C\x00o\x00m\x00p\x00e\x00t\x00i\x00t\x00i\x00o\x00l\x00o\x00lGn \x00o\x00b\x00a\x00l\x00\x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00\x00S\x00e\x00a\x00r\x00c\x00h\x00s\x00h\x00te \x00c\x00 \x002\x000\x001\x000\x00\t\x00N\x00o\x00v\x00 \x002\x000\x001\x000\x00\t\x00O\x00c\x00t\x0 \x000\x00\t\x00S\x00e\x00p\x00 \x002\x000\x001\x000\x00\t\x00A\x00u\x00g\x00 \x002\x000\x001\x000\x00\x00 \x00l\x00 \x002\x000\x001\x000\x00\t\x00J\x00u\x00n\x00\x002\x000\x001\x000\x00\t\x00M\x00a\x00y\x00y\x00 \x000\x00\t\x00A\x00p\x00r\x00 \x002\x000\x001\x000\x00\t\x00M\x00a\x00r\x00\x002\x000\x001\x00\x0\x0\x0e \x00b\x00 \x002\x000\x001\x000\x00\t\x00J\x00a\x00n\x00 \x002\x000\x001\x000\x00\t\x00A\x00d\x00\x00s\x00h\x00a\x00r\x00e\x00\t\x00S\x00e\x00a\x00r\x00c\x00h\x00\x00s\x00h\x00a\x00r\x00e\x00s\00x \x00m\x00a\x00t\x00e\x00d\x00 \x00A\x00v\x00g\x00.\x00 \x00C\x00P\x00C\x00\t\x00E\x00x\x00t\x00r\x00t\x00r\x00t\x00r\x00e x00d\x00\x00F\x00r\x00o\x00m\x00\x00W\x00e\x00b\x00\x00P\x00a\x00g\x00e\x00\t\x00L\x00o\x00c\x00a\x00c\x00a\x00c\x00aM x00n\x00t\x00h\x00l\x00y\x00\x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\n'\x00 \x00C\x00P\x00C\x00\t\x00E\x00x\x00t\x00r\x00a\x00c\x00t\x00e\x00d\x00 \x00F\x00r\x00o\x00m\x00\x00b\x0e\x00e \x00P\x00a\x00g\x00e\x00\t\x00L\x00o\x00c\x00a\x00l\x00\x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00\00as\x00\x0c\x0a\x0e \x00h\x00e\x00s\x00\n'\x00 \x00C\x00P\x00C\x00\t\x00E\x00x\x00t\x00r\x00a\x00c\x00t\x00e\x00d\x00 \x00F\x00r\x00o\x00m\x00\x00b\x0e\x00e \x00P\x00a\x00g\x00e\x00\t\x00L\x00o\x00c\x00a\x00l\x00\x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00\00as\x00\x0c\x0a\x0e \x00h\x00e\x00s\x00\n'

Adding ignore do not really help...:

添加 ignore 并没有真正帮助...：

In [69]: data[2] Out[69]: u'\u6700\u6100\u7200\u6400\u6500\u6e00\u2000\u6c00\u6100\u6d00\u7000\u2000\u7000\u6f00\u7300\u7400\u0900\u3000\u2e00\u3900\u3400\u0900\u3800\u3800\u3000\u0900\u2d00\u0900\u3300\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3900\u3000\u0900\u3400\u3800\u3000\u0900\u3500\u3900\u3000\u0900\u3500\u3900\u3000\u0900\u3700\u3200\u3000\u0900\u3700\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3200\u3000\u0900\u3200\u3600\u3000\u0900\u2d00\u0900\u2d00\u0900\ua300\u3200\u2e00\u3100\u3800\u0900\u2d00\u0900\u3400\u3800\u3000\u0a00'
In [70]: data[2].decode("utf-8", "replace") --------------------------------------------------------------------------- Traceback (most recent call last)
/Users/oleg/ in ()
/opt/local/lib/python2.5/encodings/utf_8.py in decode(input, errors) 14 15 def decode(input, errors='strict'): ---> 16 return codecs.utf_8_decode(input, errors, True) 17 18 class IncrementalEncoder(codecs.IncrementalEncoder):
: 'ascii' codec can't encode characters in position 0-87: ordinal not in range(128)
In [71]:

输入[69]：数据[2]输出[69]：u'\u6700\u6100\u7200\u6400\u6500\u6e00\u2000\u6c00\u6100\u6d00\u7000\u2000\u6400\u7000\u7000\u7000u7 \ U3000 \ u2e00 \ u3900 \ U3400 \ u0900 \ u3800 \ u3800 \ U3000 \ u0900 \ u2d00 \ u0900 \ U3300 \ U3200 \ U3000 \ u0900 \ U3300 \ u3900 \ U3000 \ u0900 \ U3300 \ u3900 \ U3000 \ u0900 \ U3400 \ u3800 \ U3000 \ u0900 \ U3500 \ u3900 \ U3000 \ u0900 \ U3500 \ u3900 \ U3000 \ u0900 \ u3700 \ U3200 \ U3000 \ u0900 \ u3700 \ U3200 \ U3000 \ u0900 \ U3300 \ u3900 \ U3000 \ u0900 \ U3300 \ U3200 \ U3000 \u0900\u3200\u3600\u3000\u0900\u2d00\u0900\u2d00\u0900\ua300\u3200\u2e00\u3100\u3800\u0900\u2d00\u300\u300\u300\u0300000\u3008
在 [70]: data[2].decode("utf-8", "replace") ---------------------------- ----------------------------------------------- 回溯（大多数最近通话最后一次）
/Users/oleg/ in ()
/opt/local/lib/python2.5/encodings/utf_8.py in decode(input, errors) 14 15 def decode(input, errors='strict'): ---> 16 return codecs.utf_8_decode(input, errors) , True) 17 18 类 IncrementalEncoder(codecs.IncrementalEncoder):
：“ascii”编解码器无法对位置 0-87 中的字符进行编码：序号不在范围内（128）
在 [71] 中：

Answer 1

采纳答案by Sven Marnach

This looks like UTF-16 data. So try

这看起来像 UTF-16 数据。所以试试

data[0].rstrip("\n").decode("utf-16")

Edit (for your update): Try to decode the whole file at once, that is

编辑（为您更新）：尝试一次解码整个文件，即

data = open(...).read()
data.decode("utf-16")

The problem is that the line breaks in UTF-16 are "\n\x00", but using readlines()will split at the "\n", leaving the "\x00" character for the next line.

问题是 UTF-16 中的换行符是“\n\x00”，但是 usingreadlines()会在“\n”处拆分，将“\x00”字符留给下一行。

Answer 2

回答by orlp

EDIT

编辑

Since you posted 2.7 this is the 2.7 solution:

由于您发布了 2.7，这是 2.7 解决方案：

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]

Ignoring undecodeable characters:

忽略无法解码的字符：

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]

Answer 3

回答by tzot

This file is a UTF-16-LE encoded file, with an initial BOM.

此文件是 UTF-16-LE 编码的文件，带有初始 BOM。

import codecs

fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()

python：unicode问题

提问by Oleg Tarasenko

采纳答案by Sven Marnach

回答by orlp

回答by tzot

相关推荐

最近更新

标签

python：unicode问题

提问by Oleg Tarasenko

采纳答案by Sven Marnach

回答by orlp

回答by tzot

相关推荐

Python 如何观察目录的变化？

在 Windows 7 上为 Python 2.7 安装 OpenCV

Python 字符串格式：多次引用一个参数

Python正则表达式匹配日期

相关推荐

最近更新

标签