windows 想让 Python 创建一个 UTF-8 文件,得到了一个 ANSI 文件。为什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8058819/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-15 18:25:18  来源:igfitidea点击:

Wanted Python to create a UTF-8 File, got an ANSI one. Why?

pythonwindowsunicodeutf-8python-2.6

提问by Metalcoder

I have the following function:

我有以下功能:

def storeTaggedCorpus(corpus, filename):
    corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
    for token in corpus:
        tagged_token = '/'.join(str for str in token)
        tagged_token = tagged_token.decode('ISO-8859-1')
        tagged_token = tagged_token.encode('utf-8')
        corpusFile.write(tagged_token)
        corpusFile.write(u"\n")
    corpusFile.close()

And when I execute it, I've got the following error:

当我执行它时,出现以下错误:

(...) in storeTaggedCorpus
    corpusFile.write(tagged_token)
  File "c:\Python26\lib\codecs.py", line 691, in write
    return self.writer.write(data)
  File "c:\Python26\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

So i went to debug it, and discovered that the created file was encoded as ANSI, not UTF-8 as declared in corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8'). If the corpusFile.write(tagged_token)is removed, this function will (obviously) work, and the file will be encoded as ANSI. If instead I remove tagged_token = tagged_token.encode('utf-8'), it will also work, BUTthe resulting file will have encoding "ANSI as UTF-8" (???) and the latin characters will be mangled. Since I'm analizing pt-br text, this is unacceptable.

所以我去调试它,发现创建的文件被编码为 ANSI,而不是corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8'). 如果 corpusFile.write(tagged_token)删除 ,则此功能(显然)将起作用,并且文件将被编码为 ANSI。如果相反,我 remove tagged_token = tagged_token.encode('utf-8'),它也可以工作,生成的文件将编码“ANSI as UTF-8”(???)并且拉丁字符将被破坏。由于我正在分析 pt-br 文本,这是不可接受的。

I believe that everything would work fine if the corpusFile opened as UTF-8, but I can't get it to work. I've searched the Web, but everything I found about Python/Unicode dealt with something else...s So why this file always ends up in ANSI? I am using Python 2.6 in Windows 7 x64, and those file encodings were informed from Notepad++.

我相信如果 corpusFile 以 UTF-8 格式打开,一切都会正常工作,但我无法让它工作。我在网上搜索过,但我发现的关于 Python/Unicode 的所有内容都涉及其他内容...s 那么为什么这个文件总是以 ANSI 结尾呢?我在 Windows 7 x64 中使用 Python 2.6,这些文件编码是从 Notepad++ 通知的。

Edit — About the corpusparameter

编辑 - 关于corpus参数

I don't know the encoding of the corpusstring. It was generated by PlaintextCorpusReader.tag()method, from NLTK. The original corpus file was encoded in UTF-8, according to Notepad++. The tagged_token.decode('ISO-8859-1')is just a guess. I've tried to decode it as cp1252, and got the same mangled characters from ISO-8859-1.

我不知道corpus字符串的编码。它是通过PlaintextCorpusReader.tag()方法生成的,来自NLTK。根据 Notepad++,原始语料库文件以 UTF-8 编码。这tagged_token.decode('ISO-8859-1')只是一个猜测。我尝试将其解码为 cp1252,并从 ISO-8859-1 中得到相同的重整字符。

采纳答案by phihag

When you open the file with codec.open('w', encoding='utf8'), there is no point in writing byte arrays (strobjects) into the file. Instead, write unicodeobjects, like this:

当您使用 打开文件时codec.open('w', encoding='utf8'),将字节数组(str对象)写入文件是没有意义的。相反,编写unicode对象,如下所示:

corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
# ...
tagged_token = '\xdcml\xe4ut'
tagged_token = tagged_token.decode('ISO-8859-1')
corpusFile.write(tagged_token)
corpusFile.write(u'\n')

This will write platform-dependent End-Of-Line characters.

这将写入依赖于平台的行尾字符。

Alternatively, open a binary file and write byte arrays of already-encoded strings:

或者,打开一个二进制文件并写入已编码字符串的字节数组:

corpusFile = open(filename, mode = 'wb')
# ...
tagged_token = '\xdcml\xe4ut'
tagged_token = tagged_token.decode('ISO-8859-1')
corpusFile.write(tagged_token.encode('utf-8'))
corpusFile.write('\n')

This will write platform-independent EOLs. If you want a platform-dependent EOL, print os.sepinstead of '\n'.

这将编写独立于平台的 EOL。如果您想要依赖于平台的 EOL,请打印os.sep而不是'\n'.

Note that the encoding naming in Notepad++ is misleading: ANSI as UTF-8iswhat you want.

请注意,Notepad++中的编码命名具有误导性ANSI as UTF-8就是您想要的。

回答by ekhumoro

Try writing the file with a UTF-8 signature (aka BOM):

尝试使用 UTF-8 签名(又名 BOM)编写文件:

def storeTaggedCorpus(corpus, filename):
    corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8-sig')
    for token in corpus:
        tagged_token = '/'.join(str for str in token)
        # print(type(tagged_token)); break
        # tagged_token = tagged_token.decode('cp1252')
        corpusFile.write(tagged_token)
        corpusFile.write(u"\n")
    corpusFile.close()

Note that this will only work properly if tagged_tokenis a unicode string. To check that, uncomment the first comment in the above code - it should print <type 'unicode'>.

请注意,这仅tagged_token在 unicode 字符串时才能正常工作。要检查这一点,请取消注释上述代码中的第一个注释 - 它应该打印<type 'unicode'>.

If tagged_tokenis not a unicode string, then you will need to decode it first using the second commented line. (NB: I've assumed a "cp1252" encoding, but if you're certain it's "iso-8859-1", then of course you will need to change it.)

如果tagged_token不是 unicode 字符串,则您需要先使用第二个注释行对其进行解码。(注意:我假设了“cp1252”编码,但如果您确定它是“iso-8859-1”,那么您当然需要更改它。)

回答by John Machin

If you are seeing "mangled" characters from a file, you need to ensure that whatever you are using to view the file understands that the file is UTF-8-encoded.

如果您看到文件中的“错位”字符,则需要确保用于查看文件的任何内容都了解该文件是 UTF-8 编码的。

The files created by this code:

此代码创建的文件:

import codecs
for enc in "utf-8 utf-8-sig".split():
    with codecs.open(enc + ".txt", mode = 'w', encoding = enc) as corpusFile:
        tagged_token = '\xdcml\xe4ut'
        tagged_token = tagged_token.decode('cp1252') # not 'ISO-8859-1'
        corpusFile.write(tagged_token) # write unicode objects
        corpusFile.write(u'\n')

are identified thusly:

是这样确定的:

Notepad++ (version 5.7 (UNICODE)) : UTF-8 without BOM, UTF-8
Firefox (7.0.1): Western(ISO-8859-1), Unicode (UTF-8)
Notepad (Windows 7): UTF-8, UTF-8

Notepad++ (version 5.7 (UNICODE)) : UTF-8 without BOM, UTF-8
Firefox (7.0.1): Western(ISO-8859-1), Unicode (UTF-8)
Notepad (Windows 7): UTF-8, UTF-8

Putting a BOM in your UTF-8 file, while deprecated on Unix systems, gives you a much better chance on Windows that other software will be able to recognise your file as UTF-8-encoded.

将 BOM 放入您的 UTF-8 文件中,虽然在 Unix 系统上已弃用,但在 Windows 上为您提供了更好的机会,其他软件将能够将您的文件识别为 UTF-8 编码。