windows 想让 Python 创建一个 UTF-8 文件,得到了一个 ANSI 文件。为什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8058819/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Wanted Python to create a UTF-8 File, got an ANSI one. Why?
提问by Metalcoder
I have the following function:
我有以下功能:
def storeTaggedCorpus(corpus, filename):
corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
for token in corpus:
tagged_token = '/'.join(str for str in token)
tagged_token = tagged_token.decode('ISO-8859-1')
tagged_token = tagged_token.encode('utf-8')
corpusFile.write(tagged_token)
corpusFile.write(u"\n")
corpusFile.close()
And when I execute it, I've got the following error:
当我执行它时,出现以下错误:
(...) in storeTaggedCorpus
corpusFile.write(tagged_token)
File "c:\Python26\lib\codecs.py", line 691, in write
return self.writer.write(data)
File "c:\Python26\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
So i went to debug it, and discovered that the created file was encoded as ANSI, not UTF-8 as declared in corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
. If the
corpusFile.write(tagged_token)
is removed, this function will (obviously) work, and the file will be encoded as ANSI. If instead I remove tagged_token = tagged_token.encode('utf-8')
, it will also work, BUTthe resulting file will have encoding "ANSI as UTF-8" (???) and the latin characters will be mangled. Since I'm analizing pt-br text, this is unacceptable.
所以我去调试它,发现创建的文件被编码为 ANSI,而不是corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
. 如果
corpusFile.write(tagged_token)
删除 ,则此功能(显然)将起作用,并且文件将被编码为 ANSI。如果相反,我 remove tagged_token = tagged_token.encode('utf-8')
,它也可以工作,但生成的文件将编码“ANSI as UTF-8”(???)并且拉丁字符将被破坏。由于我正在分析 pt-br 文本,这是不可接受的。
I believe that everything would work fine if the corpusFile opened as UTF-8, but I can't get it to work. I've searched the Web, but everything I found about Python/Unicode dealt with something else...s So why this file always ends up in ANSI? I am using Python 2.6 in Windows 7 x64, and those file encodings were informed from Notepad++.
我相信如果 corpusFile 以 UTF-8 格式打开,一切都会正常工作,但我无法让它工作。我在网上搜索过,但我发现的关于 Python/Unicode 的所有内容都涉及其他内容...s 那么为什么这个文件总是以 ANSI 结尾呢?我在 Windows 7 x64 中使用 Python 2.6,这些文件编码是从 Notepad++ 通知的。
Edit — About the corpus
parameter
编辑 - 关于corpus
参数
I don't know the encoding of the corpus
string. It was generated by PlaintextCorpusReader.tag()
method, from NLTK. The original corpus file was encoded in UTF-8, according to Notepad++. The tagged_token.decode('ISO-8859-1')
is just a guess. I've tried to decode it as cp1252, and got the same mangled characters from ISO-8859-1.
我不知道corpus
字符串的编码。它是通过PlaintextCorpusReader.tag()
方法生成的,来自NLTK。根据 Notepad++,原始语料库文件以 UTF-8 编码。这tagged_token.decode('ISO-8859-1')
只是一个猜测。我尝试将其解码为 cp1252,并从 ISO-8859-1 中得到相同的重整字符。
采纳答案by phihag
When you open the file with codec.open('w', encoding='utf8')
, there is no point in writing byte arrays (str
objects) into the file. Instead, write unicode
objects, like this:
当您使用 打开文件时codec.open('w', encoding='utf8')
,将字节数组(str
对象)写入文件是没有意义的。相反,编写unicode
对象,如下所示:
corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
# ...
tagged_token = '\xdcml\xe4ut'
tagged_token = tagged_token.decode('ISO-8859-1')
corpusFile.write(tagged_token)
corpusFile.write(u'\n')
This will write platform-dependent End-Of-Line characters.
这将写入依赖于平台的行尾字符。
Alternatively, open a binary file and write byte arrays of already-encoded strings:
或者,打开一个二进制文件并写入已编码字符串的字节数组:
corpusFile = open(filename, mode = 'wb')
# ...
tagged_token = '\xdcml\xe4ut'
tagged_token = tagged_token.decode('ISO-8859-1')
corpusFile.write(tagged_token.encode('utf-8'))
corpusFile.write('\n')
This will write platform-independent EOLs. If you want a platform-dependent EOL, print os.sep
instead of '\n'
.
这将编写独立于平台的 EOL。如果您想要依赖于平台的 EOL,请打印os.sep
而不是'\n'
.
Note that the encoding naming in Notepad++ is misleading: ANSI as UTF-8
iswhat you want.
回答by ekhumoro
Try writing the file with a UTF-8 signature (aka BOM):
尝试使用 UTF-8 签名(又名 BOM)编写文件:
def storeTaggedCorpus(corpus, filename):
corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8-sig')
for token in corpus:
tagged_token = '/'.join(str for str in token)
# print(type(tagged_token)); break
# tagged_token = tagged_token.decode('cp1252')
corpusFile.write(tagged_token)
corpusFile.write(u"\n")
corpusFile.close()
Note that this will only work properly if tagged_token
is a unicode string. To check that, uncomment the first comment in the above code - it should print <type 'unicode'>
.
请注意,这仅tagged_token
在 unicode 字符串时才能正常工作。要检查这一点,请取消注释上述代码中的第一个注释 - 它应该打印<type 'unicode'>
.
If tagged_token
is not a unicode string, then you will need to decode it first using the second commented line. (NB: I've assumed a "cp1252" encoding, but if you're certain it's "iso-8859-1", then of course you will need to change it.)
如果tagged_token
不是 unicode 字符串,则您需要先使用第二个注释行对其进行解码。(注意:我假设了“cp1252”编码,但如果您确定它是“iso-8859-1”,那么您当然需要更改它。)
回答by John Machin
If you are seeing "mangled" characters from a file, you need to ensure that whatever you are using to view the file understands that the file is UTF-8-encoded.
如果您看到文件中的“错位”字符,则需要确保用于查看文件的任何内容都了解该文件是 UTF-8 编码的。
The files created by this code:
此代码创建的文件:
import codecs
for enc in "utf-8 utf-8-sig".split():
with codecs.open(enc + ".txt", mode = 'w', encoding = enc) as corpusFile:
tagged_token = '\xdcml\xe4ut'
tagged_token = tagged_token.decode('cp1252') # not 'ISO-8859-1'
corpusFile.write(tagged_token) # write unicode objects
corpusFile.write(u'\n')
are identified thusly:
是这样确定的:
Notepad++ (version 5.7 (UNICODE)) : UTF-8 without BOM, UTF-8
Firefox (7.0.1): Western(ISO-8859-1), Unicode (UTF-8)
Notepad (Windows 7): UTF-8, UTF-8
Notepad++ (version 5.7 (UNICODE)) : UTF-8 without BOM, UTF-8
Firefox (7.0.1): Western(ISO-8859-1), Unicode (UTF-8)
Notepad (Windows 7): UTF-8, UTF-8
Putting a BOM in your UTF-8 file, while deprecated on Unix systems, gives you a much better chance on Windows that other software will be able to recognise your file as UTF-8-encoded.
将 BOM 放入您的 UTF-8 文件中,虽然在 Unix 系统上已弃用,但在 Windows 上为您提供了更好的机会,其他软件将能够将您的文件识别为 UTF-8 编码。