python 自动以正确的编码打开文件

Question

提问by Khelben

I'm dealing with some problems in a few files about the encoding. We receive files from other company and have to read them (the files are in csv format)

我正在处理一些关于编码的文件中的一些问题。我们收到其他公司的文件，必须阅读（文件为csv格式）

Strangely, the files appear to be encoded in UTF-16. I am managing to do that, but I have to open them using the codecsmodule and specifying the encoding, this way.

奇怪的是，这些文件似乎是用 UTF-16 编码的。我正在设法做到这一点，但我必须以codecs这种方式使用模块打开它们并指定编码。

ENCODING = 'utf-16'
with codecs.open(test_file, encoding=ENCODING) as csv_file:
    # Autodetect dialect
    dialect = csv.Sniffer().sniff(descriptor.read(1024))
    descriptor.seek(0)
    input_file = csv.reader(descriptor, dialect=dialect)

    for line in input_file:
       do_funny_things()

But, just like I am able to get the dialect in a more agnostic way, I 'm thinking it will be great to have a way of opening automatically the files with its proper encoding, at least all the text files. There are other programs, like vim that achieve that.

但是，就像我能够以一种更加不可知的方式获得方言一样，我认为有一种方法可以自动打开具有正确编码的文件，至少是所有文本文件。还有其他程序，比如 vim 可以实现这一点。

Anyone knows a way of doing that in python 2.6?

任何人都知道在 python 2.6 中这样做的方法吗？

PD: I hope that this will be solved in Python 3, as all the strings are Unicode...

PD：我希望这会在 Python 3 中解决，因为所有的字符串都是 Unicode ...

Answer 1

回答by Desintegr

chardetcan help you.

chardet可以帮助您。

Character encoding auto-detection in Python 2 and 3. As smart as your browser. Open source.

Python 2 和 3 中的字符编码自动检测。与您的浏览器一样智能。开源。

Answer 2

回答by jcdyer

It won't be "fixed" in python 3, as it's not a fixable problem. Many documents are valid in several encodings, so the only way to determine the proper encoding is to know something about the document. Fortunately, in most cases we do know something about the document, like for instance, most characters will come clustered into distinct unicode blocks. A document in english will mostly contain characters within the first 128 codepoints. A document in russian will contain mostly cyrillic codepoints. Most document will contain spaces and newlines. These clues can be used to help you make educated guesses about what encodings are being used. Better yet, use a library written by someone who's already done the work. (Like chardet, mentioned in another answer by Desintegr.

它不会在 python 3 中被“修复”，因为它不是一个可修复的问题。许多文档在几种编码中都是有效的，因此确定正确编码的唯一方法是了解有关文档的一些信息。幸运的是，在大多数情况下，我们确实对文档有所了解，例如，大多数字符会聚集到不同的 unicode 块中。英文文档主要包含前 128 个代码点内的字符。俄文文档将主要包含西里尔代码点。大多数文档将包含空格和换行符。这些线索可用于帮助您对正在使用的编码进行有根据的猜测。更好的是，使用由已经完成工作的人编写的库。（例如chardet，在Desintegr 的另一个回答中提到。

Answer 3

回答by Mark Tolonen

csv.reader cannot handle Unicode strings in 2.x. See the bottom of the csv documentationand this questionfor ways to handle it.

csv.reader 无法处理 2.x 中的 Unicode 字符串。有关处理方法，请参阅csv 文档和此问题的底部。

Answer 4

回答by RdV

If it will be fixed in Python 3, it should also be fixed by using

如果它会在 Python 3 中被修复，它也应该通过使用来修复

from __future__ import unicode_literals

python 自动以正确的编码打开文件

提问by Khelben

回答by Desintegr

回答by jcdyer

回答by Mark Tolonen

回答by RdV

相关推荐

最近更新

标签

python 自动以正确的编码打开文件

提问by Khelben

回答by Desintegr

回答by jcdyer

回答by Mark Tolonen

回答by RdV

相关推荐

Python 相当于 PHP 的 strip_tags 吗？

python Python在父类中使用派生类的方法？

如何在 Python 中快速搜索 .csv 文件

列出已安装的 python 站点包？

相关推荐

最近更新

标签