windows 想让 Python 创建一个 UTF-8 文件，得到了一个 ANSI 文件。为什么？

Question

提问by Metalcoder

I have the following function:

我有以下功能：

def storeTaggedCorpus(corpus, filename):
    corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
    for token in corpus:
        tagged_token = '/'.join(str for str in token)
        tagged_token = tagged_token.decode('ISO-8859-1')
        tagged_token = tagged_token.encode('utf-8')
        corpusFile.write(tagged_token)
        corpusFile.write(u"\n")
    corpusFile.close()

And when I execute it, I've got the following error:

当我执行它时，出现以下错误：

(...) in storeTaggedCorpus
    corpusFile.write(tagged_token)
  File "c:\Python26\lib\codecs.py", line 691, in write
    return self.writer.write(data)
  File "c:\Python26\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

So i went to debug it, and discovered that the created file was encoded as ANSI, not UTF-8 as declared in corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8'). If the corpusFile.write(tagged_token)is removed, this function will (obviously) work, and the file will be encoded as ANSI. If instead I remove tagged_token = tagged_token.encode('utf-8'), it will also work, BUTthe resulting file will have encoding "ANSI as UTF-8" (???) and the latin characters will be mangled. Since I'm analizing pt-br text, this is unacceptable.

所以我去调试它，发现创建的文件被编码为 ANSI，而不是corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8'). 如果 corpusFile.write(tagged_token)删除，则此功能（显然）将起作用，并且文件将被编码为 ANSI。如果相反，我 remove tagged_token = tagged_token.encode('utf-8')，它也可以工作，但生成的文件将编码“ANSI as UTF-8”（???）并且拉丁字符将被破坏。由于我正在分析 pt-br 文本，这是不可接受的。

I believe that everything would work fine if the corpusFile opened as UTF-8, but I can't get it to work. I've searched the Web, but everything I found about Python/Unicode dealt with something else...s So why this file always ends up in ANSI? I am using Python 2.6 in Windows 7 x64, and those file encodings were informed from Notepad++.

我相信如果 corpusFile 以 UTF-8 格式打开，一切都会正常工作，但我无法让它工作。我在网上搜索过，但我发现的关于 Python/Unicode 的所有内容都涉及其他内容...s 那么为什么这个文件总是以 ANSI 结尾呢？我在 Windows 7 x64 中使用 Python 2.6，这些文件编码是从 Notepad++ 通知的。

Edit — About the `corpus`parameter

编辑 - 关于`corpus`参数

I don't know the encoding of the corpusstring. It was generated by PlaintextCorpusReader.tag()method, from NLTK. The original corpus file was encoded in UTF-8, according to Notepad++. The tagged_token.decode('ISO-8859-1')is just a guess. I've tried to decode it as cp1252, and got the same mangled characters from ISO-8859-1.

我不知道corpus字符串的编码。它是通过PlaintextCorpusReader.tag()方法生成的，来自NLTK。根据 Notepad++，原始语料库文件以 UTF-8 编码。这tagged_token.decode('ISO-8859-1')只是一个猜测。我尝试将其解码为 cp1252，并从 ISO-8859-1 中得到相同的重整字符。

Answer 1

采纳答案by phihag

When you open the file with codec.open('w', encoding='utf8'), there is no point in writing byte arrays (strobjects) into the file. Instead, write unicodeobjects, like this:

当您使用打开文件时codec.open('w', encoding='utf8')，将字节数组（str对象）写入文件是没有意义的。相反，编写unicode对象，如下所示：

corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
# ...
tagged_token = '\xdcml\xe4ut'
tagged_token = tagged_token.decode('ISO-8859-1')
corpusFile.write(tagged_token)
corpusFile.write(u'\n')

This will write platform-dependent End-Of-Line characters.

这将写入依赖于平台的行尾字符。

Alternatively, open a binary file and write byte arrays of already-encoded strings:

或者，打开一个二进制文件并写入已编码字符串的字节数组：

corpusFile = open(filename, mode = 'wb')
# ...
tagged_token = '\xdcml\xe4ut'
tagged_token = tagged_token.decode('ISO-8859-1')
corpusFile.write(tagged_token.encode('utf-8'))
corpusFile.write('\n')

This will write platform-independent EOLs. If you want a platform-dependent EOL, print os.sepinstead of '\n'.

这将编写独立于平台的 EOL。如果您想要依赖于平台的 EOL，请打印os.sep而不是'\n'.

Note that the encoding naming in Notepad++ is misleading: ANSI as UTF-8iswhat you want.

请注意，Notepad++中的编码命名具有误导性：ANSI as UTF-8就是您想要的。

Answer 2

回答by ekhumoro

Try writing the file with a UTF-8 signature (aka BOM):

尝试使用 UTF-8 签名（又名 BOM）编写文件：

def storeTaggedCorpus(corpus, filename):
    corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8-sig')
    for token in corpus:
        tagged_token = '/'.join(str for str in token)
        # print(type(tagged_token)); break
        # tagged_token = tagged_token.decode('cp1252')
        corpusFile.write(tagged_token)
        corpusFile.write(u"\n")
    corpusFile.close()

Note that this will only work properly if tagged_tokenis a unicode string. To check that, uncomment the first comment in the above code - it should print <type 'unicode'>.

请注意，这仅tagged_token在 unicode 字符串时才能正常工作。要检查这一点，请取消注释上述代码中的第一个注释 - 它应该打印<type 'unicode'>.

If tagged_tokenis not a unicode string, then you will need to decode it first using the second commented line. (NB: I've assumed a "cp1252" encoding, but if you're certain it's "iso-8859-1", then of course you will need to change it.)

如果tagged_token不是 unicode 字符串，则您需要先使用第二个注释行对其进行解码。（注意：我假设了“cp1252”编码，但如果您确定它是“iso-8859-1”，那么您当然需要更改它。）

Answer 3

回答by John Machin

If you are seeing "mangled" characters from a file, you need to ensure that whatever you are using to view the file understands that the file is UTF-8-encoded.

如果您看到文件中的“错位”字符，则需要确保用于查看文件的任何内容都了解该文件是 UTF-8 编码的。

The files created by this code:

此代码创建的文件：

import codecs
for enc in "utf-8 utf-8-sig".split():
    with codecs.open(enc + ".txt", mode = 'w', encoding = enc) as corpusFile:
        tagged_token = '\xdcml\xe4ut'
        tagged_token = tagged_token.decode('cp1252') # not 'ISO-8859-1'
        corpusFile.write(tagged_token) # write unicode objects
        corpusFile.write(u'\n')

are identified thusly:

是这样确定的：

Notepad++ (version 5.7 (UNICODE)) : UTF-8 without BOM, UTF-8
Firefox (7.0.1): Western(ISO-8859-1), Unicode (UTF-8)
Notepad (Windows 7): UTF-8, UTF-8

Putting a BOM in your UTF-8 file, while deprecated on Unix systems, gives you a much better chance on Windows that other software will be able to recognise your file as UTF-8-encoded.

将 BOM 放入您的 UTF-8 文件中，虽然在 Unix 系统上已弃用，但在 Windows 上为您提供了更好的机会，其他软件将能够将您的文件识别为 UTF-8 编码。

windows 想让 Python 创建一个 UTF-8 文件，得到了一个 ANSI 文件。为什么？

提问by Metalcoder

Edit — About the `corpus`parameter

编辑 - 关于`corpus`参数

采纳答案by phihag

回答by ekhumoro

回答by John Machin

相关推荐

最近更新

标签

windows 想让 Python 创建一个 UTF-8 文件，得到了一个 ANSI 文件。为什么？

提问by Metalcoder

Edit — About the corpusparameter

编辑 - 关于corpus参数

采纳答案by phihag

回答by ekhumoro

回答by John Machin

相关推荐

windows Git 子模块混乱：如何与不熟悉 git 的开发人员一起使用 git 子模块？

Windows 如何计算卷唯一 ID？

windows Eclipse，导入项目但不复制

C++中的睡眠操作，平台：windows

相关推荐

最近更新

标签

Edit — About the `corpus`parameter

编辑 - 关于`corpus`参数