Python:如何强制输出 iso-8859-1 文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2191730/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:00:07  来源:igfitidea点击:

Python: How do I force iso-8859-1 file output?

pythoncharacter-encoding

提问by AP257

How do I force Latin-1 (which I guess means iso-8859-1?) file output in Python?

如何在 Python 中强制输出 Latin-1(我猜是 iso-8859-1?)文件?

Here's my code at the moment. It works, but trying to import the resulting output file into a Latin-1 MySQL table produces weird encoding errors.

这是我目前的代码。它可以工作,但是尝试将生成的输出文件导入到 Latin-1 MySQL 表中会产生奇怪的编码错误

outputFile = file( "textbase.tab", "w" )
for k, v in textData.iteritems():
    complete_line = k + '~~~~~' + v + '~~~~~' + " ENDOFTHELINE"
    outputFile.write(complete_line)
    outputFile.write( "\n" )
outputFile.close()

The resulting output file seems to be saved in "Western (Mac OS Roman)", but if I then save it in Latin-1, I still get strange encoding problems. How can I make sure that the strings used, and the file itself, are all encoded in Latin-1 as soon as they are generated?

生成的输出文件似乎保存在“Western (Mac OS Roman)”中,但是如果我将其保存在 Latin-1 中,我仍然会遇到奇怪的编码问题。我如何确保所使用的字符串和文件本身在生成后都以 Latin-1 编码?

The original strings (in the textDatadictionary) have been parsed in from an RTF file - I don't know if that makes a difference.

原始字符串(在textData字典中)已从 RTF 文件中解析出来 - 我不知道这是否有所不同。

I'm a bit new to Python and to encoding generally, so apologies if this is a dumb question. I have tried looking at the docs but haven't got very far.

我对 Python 和一般编码有点陌生,所以如果这是一个愚蠢的问题,我很抱歉。我曾尝试查看文档,但还没有走多远。

I'm using Python 2.6.1.

我正在使用 Python 2.6.1。

回答by Torsten Marek

Simply use the codecsmodule for writing the file:

只需使用codecs模块来写入文件:

import codecs
outputFile = codecs.open("textbase.tab", "w", "ISO-8859-1")

Of course, the strings you write have to be Unicode strings (type unicode), they won't be converted if they are plain strobjects (which are basically just arrays of bytes). I guess you are reading the RTF file with the normal Python file object as well, so you might have to convert that to using codecs.openas well.

当然,你写的字符串必须是Unicode字符串(type unicode),如果它们是普通str对象(基本上只是字节数组),它们就不会被转换。我猜您也在使用普通的 Python 文件对象读取 RTF 文件,因此您可能也必须将其转换为 using codecs.open

回答by beardc

For me, io.openworks a bit faster on python 2.7 for writes, and an order of magnitude faster for reads:

对我来说,io.open在 python 2.7 上的写入速度要快一些,读取速度要快一个数量级:

import io
with io.open("textbase.tab", "w", encoding="ISO-8859-1") as outputFile:
    ...

In python 3, you can just pass the encodingkeyword arg to open.

在 python 3 中,您可以encoding关键字 arg传递给open.

回答by Matthew Flaschen

I think it's just:

我认为这只是:

outputFile = file( "textbase.tab", "wb" )
for k, v in textData.iteritems():
    complete_line = k + '~~~~~' + v + '~~~~~' + " ENDOFTHELINE"
    outputFile.write((complete_line + "\n").encode("iso-8859-1"))
    outputFile.close()

As you alluded to, you need to make sure you are decoding the RTF file correctly too. For this to work, k and v should be unicode objects.

正如您所提到的,您还需要确保正确解码 RTF 文件。为此,k 和 v 应该是 unicode 对象。

回答by Lennart Regebro

The main problem here is that you don't know what encoding your data is in. If we assume you are correct in that your file ends up being in Mac OS Roman, then you need to decode the data to unicode first, and then encode it as iso-8859-1.

这里的主要问题是您不知道您的数据采用什么编码。如果我们假设您的文件最终是 Mac OS Roman 格式是正确的,那么您需要先将数据解码为 un​​icode,然后再进行编码它是 iso-8859-1。

inputFile = open("input.rtf", "rb") # The b flag is just a marker in Python 2.
data = inputFile.read().decode('mac_roman')
textData = yourparsefunctionhere(data)

outputFile = open( "textbase.tab", "wb" ) # don't use file()
for k, v in textData.iteritems():
    complete_line = k + '~~~~~' + v + '~~~~~' + " ENDOFTHELINE"
    outputFile.write((complete_line + "\n").encode("iso-8859-1"))
    outputFile.close()

But I wouldn't be surprised, since it's RTF, if it's Windows encoded, so you might want to try that too. I don't know how RTF specifies the encoding.

但我不会感到惊讶,因为它是 RTF,如果它是 Windows 编码的,那么您可能也想尝试一下。我不知道 RTF 如何指定编码。