python 自动以正确的编码打开文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2342284/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Open a file in the proper encoding automatically
提问by Khelben
I'm dealing with some problems in a few files about the encoding. We receive files from other company and have to read them (the files are in csv format)
我正在处理一些关于编码的文件中的一些问题。我们收到其他公司的文件,必须阅读(文件为csv格式)
Strangely, the files appear to be encoded in UTF-16. I am managing to do that, but I have to open them using the codecs
module and specifying the encoding, this way.
奇怪的是,这些文件似乎是用 UTF-16 编码的。我正在设法做到这一点,但我必须以codecs
这种方式使用模块打开它们并指定编码。
ENCODING = 'utf-16'
with codecs.open(test_file, encoding=ENCODING) as csv_file:
# Autodetect dialect
dialect = csv.Sniffer().sniff(descriptor.read(1024))
descriptor.seek(0)
input_file = csv.reader(descriptor, dialect=dialect)
for line in input_file:
do_funny_things()
But, just like I am able to get the dialect in a more agnostic way, I 'm thinking it will be great to have a way of opening automatically the files with its proper encoding, at least all the text files. There are other programs, like vim that achieve that.
但是,就像我能够以一种更加不可知的方式获得方言一样,我认为有一种方法可以自动打开具有正确编码的文件,至少是所有文本文件。还有其他程序,比如 vim 可以实现这一点。
Anyone knows a way of doing that in python 2.6?
任何人都知道在 python 2.6 中这样做的方法吗?
PD: I hope that this will be solved in Python 3, as all the strings are Unicode...
PD:我希望这会在 Python 3 中解决,因为所有的字符串都是 Unicode ...
回答by Desintegr
回答by jcdyer
It won't be "fixed" in python 3, as it's not a fixable problem. Many documents are valid in several encodings, so the only way to determine the proper encoding is to know something about the document. Fortunately, in most cases we do know something about the document, like for instance, most characters will come clustered into distinct unicode blocks. A document in english will mostly contain characters within the first 128 codepoints. A document in russian will contain mostly cyrillic codepoints. Most document will contain spaces and newlines. These clues can be used to help you make educated guesses about what encodings are being used. Better yet, use a library written by someone who's already done the work. (Like chardet
, mentioned in another answer by Desintegr.
它不会在 python 3 中被“修复”,因为它不是一个可修复的问题。许多文档在几种编码中都是有效的,因此确定正确编码的唯一方法是了解有关文档的一些信息。幸运的是,在大多数情况下,我们确实对文档有所了解,例如,大多数字符会聚集到不同的 unicode 块中。英文文档主要包含前 128 个代码点内的字符。俄文文档将主要包含西里尔代码点。大多数文档将包含空格和换行符。这些线索可用于帮助您对正在使用的编码进行有根据的猜测。更好的是,使用由已经完成工作的人编写的库。(例如chardet
,在Desintegr 的另一个回答中提到。
回答by Mark Tolonen
csv.reader cannot handle Unicode strings in 2.x. See the bottom of the csv documentationand this questionfor ways to handle it.
回答by RdV
If it will be fixed in Python 3, it should also be fixed by using
如果它会在 Python 3 中被修复,它也应该通过使用来修复
from __future__ import unicode_literals