python 使用挪威语字母???在蟒蛇

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/664372/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:34:48  来源:igfitidea点击:

Using norwegian letters ??? in python

pythonutf-8

提问by ThoKra

I'm learning python and PyGTK now, and have created a simple Music Organizer. http://pastebin.com/m2b596852But when it edits songs with the Norwegian letters ?, ?, and ? it's just changing them to a weird character.

我现在正在学习 python 和 PyGTK,并创建了一个简单的音乐管理器。 http://pastebin.com/m2b596852但是当它编辑带有挪威字母 ?、? 和 ? 的歌曲时 它只是将它们更改为一个奇怪的字符。

So is there any good way of opening or encode the names into utf-8 characters?

那么有没有什么好的方法可以将名称打开或编码为 utf-8 字符?

Two relevant places from the above code:

上面代码中有两个相关的地方:

Read info from a file:

从文件中读取信息:

def __parse(self, filename):
    "parse ID3v1.0 tags from MP3 file"
    self.clear()
    self['artist'] = 'Unknown'
    self['title'] = 'Unknown'
    try:
        fsock = open(filename, "rb", 0)
        try:
            fsock.seek(-128, 2)
            tagdata = fsock.read(128)
        finally:
            fsock.close()
        if tagdata[:3] == 'TAG':
            for tag, (start, end, parseFunc) in self.tagDataMap.items():
                self[tag] = parseFunc(tagdata[start:end])
    except IOError:
        pass

Print to sys.stdout info:

打印到 sys.stdout 信息:

for info in files:
    try:
        os.rename(info['name'], 
            os.path.join(self.dir, info['artist'])+' - '+info['title']+'.mp3')

        print 'From: '+ info['name'].replace(os.path.join(self.dir, ''), '')
        print 'To:   '+ info['artist'] +' - '+info['title']+'.mp3'
        print
        self.progressbar.set_fraction(i/num)
        self.progressbar.set_text('File %d of %d' % (i, num))
        i += 1
    except IOError:
        print 'Rename fail'

回答by Jarret Hardie

You want to start by decoding the input FROM the charset it is in TO utf-8 (in Python, encode means "take it from unicode/utf-8 to some other charset").

您想首先将输入从字符集中解码到 utf-8(在 Python 中,编码的意思是“将其从 unicode/utf-8 转换为其他字符集”)。

Some googling suggests the Norwegian charset is plain-ole 'iso-8859-1'... I hope someone can correct me if I'm wrong on this detail. Regardless, whatever the name of the charset in the following example:

一些谷歌搜索表明挪威字符集是普通的“iso-8859-1”......如果我在这个细节上错了,我希望有人能纠正我。无论如何,无论以下示例中的字符集名称是什么:

tagdata[start:end].decode('iso-8859-1')

In a real-world app, I realize you can't guarantee that the input is norwegian, or any other charset. In this case, you will probably want to proceed through a series of likely charsets to see which you can convert successfully. Both SO and Google have some suggestions on algorithms for doing this effectively in Python. It sounds scarier than it really is.

在现实世界的应用程序中,我意识到您不能保证输入是挪威语或任何其他字符集。在这种情况下,您可能希望通过一系列可能的字符集来查看可以成功转换的字符集。SO 和 Google 都对在 Python 中有效执行此操作的算法提出了一些建议。这听起来比实际情况更可怕。

回答by David Z

You'd need to convert the bytestrings you read from the file into Unicode character strings. Looking at your code, I would do this in the parsing function, i.e. replace stripnullswith something like this

您需要将从文件中读取的字节串转换为 Unicode 字符串。查看您的代码,我会在解析函数中执行此操作,即替换stripnulls为类似的内容

def stripnulls_and_decode(data):
    return codecs.utf_8_decode(data.replace("
 tagdata[start:end].decode("utf-8")
", "")).strip()

Note that this will only work if the strings in the file are in fact encoded in UTF-8 - if they're in a different encoding, you'd have to use the corresponding decoding function from the codecsmodule.

请注意,这仅在文件中的字符串实际上以 UTF-8 编码时才有效 - 如果它们采用不同的编码,则必须使用codecs模块中的相应解码函数。

回答by jfs

I don't know what encodings are used for mp3 tags but if you are sure that it is UTF-8 then:

我不知道 mp3 标签使用什么编码,但如果您确定它是 UTF-8,那么:

##代码##

The line # -*- coding: utf-8 -*-defines your source code encoding and doesn't define encoding used to read from or write to files.

该行# -*- coding: utf-8 -*-定义您的源代码编码,而不定义用于读取或写入文件的编码。