Python - 以奇怪的 utf-16 格式读取文本文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19328874/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python - read text file with weird utf-16 format
提问by DanHickstein
I'm trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:
我正在尝试将文本文件读入 python,但它似乎使用了一些非常奇怪的编码。我尝试通常的:
file = open('data.txt','r')
lines = file.readlines()
for line in lines[0:1]:
print line,
print line.split()
Output:
输出:
0.0200197 1.97691e-005
['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']
Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. Of course, when I try to convert those strings to floats, this produces an error. Any idea about how I can convert these back into numbers?
打印该行工作正常,但是在我尝试拆分该行以便将其转换为浮点数之后,它看起来很疯狂。当然,当我尝试将这些字符串转换为浮点数时,会产生错误。关于如何将这些转换回数字的任何想法?
I put the sample datafile here if you would like to try to load it: https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt
如果您想尝试加载它,我将示例数据文件放在这里:https: //dl.dropboxusercontent.com/u/3816350/Posts/data.txt
I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file.
我想简单地使用 numpy.loadtxt 或 numpy.genfromtxt,但他们也不想处理这个疯狂的文件。
采纳答案by abarnert
I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.
我敢打赌这是一个 UTF-16-LE 文件,你正在阅读它,无论你的默认编码是什么。
In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.
在 UTF-16 中,每个字符占用两个字节。* 如果您的字符都是 ASCII,这意味着 UTF-16 编码看起来像 ASCII 编码,每个字符后都有一个额外的 '\x00'。
To fix this, just decode the data:
要解决这个问题,只需解码数据:
print line.decode('utf-16-le').split()
Or do the same thing at the file level with the io or codecs module:
或者使用 io 或 codecs 模块在文件级别做同样的事情:
file = io.open('data.txt','r', encoding='utf-16-le')
* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.
* 这有点过于简单化了:每个 BMP 字符占用两个字节;每个非 BMP 字符都变成一个代理对,两个代理中的每一个都占用两个字节。但你可能并不关心这些细节。
回答by Peter DeGlopper
Looks like UTF-16 to me.
对我来说看起来像 UTF-16。
>>> test_utf16 = '0\x00.\x000\x002\x000\x000\x001\x009\x007\x00'
>>> test_utf16.decode('utf-16')
u'0.0200197'
You can work directly off the Unicode strings:
您可以直接使用 Unicode 字符串:
>>> float(test_utf16)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: null byte in argument for float()
>>> float(test_utf16.decode('utf-16'))
0.020019700000000001
Or encode them to something different, if you prefer:
或者,如果您愿意,将它们编码为不同的东西:
>>> float(test_utf16.decode('utf-16').encode('ascii'))
0.020019700000000001
Note that you need to do this as early as possible in your processing. As your comment noted, split
will behave incorrectly on the utf-16 encoded form. The utf-16 representation of the space character ' '
is ' \x00'
, so split
removes the whitespace but leaves the null byte.
请注意,您需要在处理过程中尽早执行此操作。正如您的评论所指出的,split
在 utf-16 编码形式上将表现不正确。空格字符的 utf-16 表示' '
是' \x00'
,因此split
删除空格但留下空字节。
The 2.6 and later io
library can handle this for you, as can the older codecs
library. io
handles linefeeds better, so it's preferable if available.
2.6 及更高版本的io
库可以为您处理这个问题,旧的codecs
库也可以。io
更好地处理换行符,因此最好在可用时使用。
回答by DanHickstein
This is really just @abarnert's suggestion, but I wanted to post it as an answer since this is the simplest solution and the one that I ended up using:
这实际上只是@abernert 的建议,但我想将其作为答案发布,因为这是最简单的解决方案,也是我最终使用的解决方案:
file = io.open(filename,'r',encoding='utf-16-le')
data = np.loadtxt(file,skiprows=8)
This demonstrates how you can create a file object using io.open using whatever crazy encoding your file happens to have, and then pass that file object to np.loadtxt (or np.genfromtxt) for quick-and-easy loading.
这演示了如何使用 io.open 使用文件碰巧具有的任何疯狂编码创建文件对象,然后将该文件对象传递给 np.loadtxt(或 np.genfromtxt)以便快速轻松地加载。
回答by oliver smith
This piece of code will do the necessary
这段代码将做必要的
file_handle=open(file_name,'rb')
file_first_line=file_handle.readline()
file_handle.close()
print file_first_line
if '\x00' in file_first_line:
file_first_line=file_first_line.replace('\x00','')
print file_first_line
When you try to use 'file_first_line.split()' before replacing, the output would contain '\x00' i just tried replacing '\x00' with empty and it worked.
当您尝试在替换之前使用 'file_first_line.split()' 时,输出将包含 '\x00' 我只是尝试用空替换 '\x00' 并且它起作用了。