Python - 以奇怪的 utf-16 格式读取文本文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19328874/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:27:36  来源:igfitidea点击:

Python - read text file with weird utf-16 format

pythonnumpyencodingutf-16le

提问by DanHickstein

I'm trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:

我正在尝试将文本文件读入 python,但它似乎使用了一些非常奇怪的编码。我尝试通常的:

file = open('data.txt','r')

lines = file.readlines()

for line in lines[0:1]:
    print line,
    print line.split()

Output:

输出:

0.0200197   1.97691e-005

['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']

Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. Of course, when I try to convert those strings to floats, this produces an error. Any idea about how I can convert these back into numbers?

打印该行工作正常,但是在我尝试拆分该行以便将其转换为浮点数之后,它看起来很疯狂。当然,当我尝试将这些字符串转换为浮点数时,会产生错误。关于如何将这些转换回数字的任何想法?

I put the sample datafile here if you would like to try to load it: https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt

如果您想尝试加载它,我将示例数据文件放在这里:https: //dl.dropboxusercontent.com/u/3816350/Posts/data.txt

I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file.

我想简单地使用 numpy.loadtxt 或 numpy.genfromtxt,但他们也不想处理这个疯狂的文件。

采纳答案by abarnert

I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.

我敢打赌这是一个 UTF-16-LE 文件,你正在阅读它,无论你的默认编码是什么。

In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.

在 UTF-16 中,每个字符占用两个字节。* 如果您的字符都是 ASCII,这意味着 UTF-16 编码看起来像 ASCII 编码,每个字符后都有一个额外的 '\x00'。

To fix this, just decode the data:

要解决这个问题,只需解码数据:

print line.decode('utf-16-le').split()

Or do the same thing at the file level with the io or codecs module:

或者使用 io 或 codecs 模块在文件级别做同样的事情:

file = io.open('data.txt','r', encoding='utf-16-le')


* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.

* 这有点过于简单化了:每个 BMP 字符占用两个字节;每个非 BMP 字符都变成一个代理对,两个代理中的每一个都占用两个字节。但你可能并不关心这些细节。

回答by Peter DeGlopper

Looks like UTF-16 to me.

对我来说看起来像 UTF-16。

>>> test_utf16 = '0\x00.\x000\x002\x000\x000\x001\x009\x007\x00'
>>> test_utf16.decode('utf-16')
u'0.0200197'

You can work directly off the Unicode strings:

您可以直接使用 Unicode 字符串:

>>> float(test_utf16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: null byte in argument for float()
>>> float(test_utf16.decode('utf-16'))
0.020019700000000001

Or encode them to something different, if you prefer:

或者,如果您愿意,将它们编码为不同的东西:

>>> float(test_utf16.decode('utf-16').encode('ascii'))
0.020019700000000001

Note that you need to do this as early as possible in your processing. As your comment noted, splitwill behave incorrectly on the utf-16 encoded form. The utf-16 representation of the space character ' 'is ' \x00', so splitremoves the whitespace but leaves the null byte.

请注意,您需要在处理过程中尽早执行此操作。正如您的评论所指出的,split在 utf-16 编码形式上将表现不正确。空格字符的 utf-16 表示' '' \x00',因此split删除空格但留下空字节。

The 2.6 and later iolibrary can handle this for you, as can the older codecslibrary. iohandles linefeeds better, so it's preferable if available.

2.6 及更高版本的io库可以为您处理这个问题,旧的codecs库也可以。io更好地处理换行符,因此最好在可用时使用。

回答by DanHickstein

This is really just @abarnert's suggestion, but I wanted to post it as an answer since this is the simplest solution and the one that I ended up using:

这实际上只是@abernert 的建议,但我想将其作为答案发布,因为这是最简单的解决方案,也是我最终使用的解决方案:

    file = io.open(filename,'r',encoding='utf-16-le')
    data = np.loadtxt(file,skiprows=8)

This demonstrates how you can create a file object using io.open using whatever crazy encoding your file happens to have, and then pass that file object to np.loadtxt (or np.genfromtxt) for quick-and-easy loading.

这演示了如何使用 io.open 使用文件碰巧具有的任何疯狂编码创建文件对象,然后将该文件对象传递给 np.loadtxt(或 np.genfromtxt)以便快速轻松地加载。

回答by oliver smith

This piece of code will do the necessary

这段代码将做必要的

file_handle=open(file_name,'rb')
file_first_line=file_handle.readline()
file_handle.close()
print file_first_line
if '\x00' in file_first_line:
    file_first_line=file_first_line.replace('\x00','')
    print file_first_line

When you try to use 'file_first_line.split()' before replacing, the output would contain '\x00' i just tried replacing '\x00' with empty and it worked.

当您尝试在替换之前使用 'file_first_line.split()' 时,输出将包含 '\x00' 我只是尝试用空替换 '\x00' 并且它起作用了。