Python、Windows、Ansi - 再次编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14079343/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 10:23:21  来源:igfitidea点击:

Python, Windows, Ansi - encoding, again

pythonwindowscharacter-encodingansi

提问by xph

Hello there,

你好呀,

even if i really tried... im stuck and somewhat desperate when it comes to Python, Windows, Ansiand character encoding. I need help, seriously... searching the web for the last few hours wasn't any help, it just drives me crazy.

即使我真的尝试过......当涉及到PythonWindowsAnsi字符编码时,我仍然陷入困境并且有些绝望。我需要帮助,真的......过去几个小时在网上搜索没有任何帮助,它只是让我发疯。

I'm new to Python, so i have almost no clue what's going on. I'm about to learn the language, so my first program, which ist almostdone, should automatically generate music-playlists from a given folder containing mp3s. That works just fine, besides one single problem...

我是 Python 新手,所以我几乎不知道发生了什么。我即将学习这门语言,所以我的第一个程序,完成了,应该会从包含 mp3 的给定文件夹自动生成音乐播放列表。这工作得很好,除了一个问题......

...i can't write Umlaute (??ü)to the playlist-file.

...我无法将 Umlaute (??ü)写入播放列表文件。

After i found a solution for "wrong-encoded"Data in the sys.argvi was able to deal with that. When reading Metadata from the MP3s, i'm using some sort of simple character substitution to get rid of all those international special chars, like french accents or this crazy skandinavian "o"with a slash in it (i don't even know how to type it...). All fine.

在我找到“错误编码”数据的解决方案后,我sys.argv能够解决这个问题。从 MP3 读取元数据时,我使用某种简单的字符替换来摆脱所有那些国际特殊字符,例如法国口音或这个带有斜线的疯狂斯堪的纳维亚“o” (我什至不知道如何输入它...)。一切都很好。

But i'd like to write at least the mentioned Umlaute to the playlist-file, those characters are really common here in Germany. And unlike the Metadata, where i don't care about some missing characters or miss-spelled words, this is relevant - because now i'm writing the paths to the files.

但我想至少将提到的 Umlaute 写入播放列表文件,这些字符在德国非常常见。与元数据不同,我不关心某些丢失的字符或拼写错误的单词,这是相关的 - 因为现在我正在写入文件的路径。

I've tried so many various encoding and decoding methods, i can't list them all here.. heck, i'm not even able to tell which settings i tried half an hour ago. I found code online, here, and elsewhere, that seemed to work for some purposes. Not for mine.

我已经尝试了很多不同的编码和解码方法,我不能在这里一一列出..哎呀,我什至无法说出半小时前我尝试了哪些设置。我在网上、这里和其他地方找到了似乎对某些目的有效的代码。不适合我的。

I think the tricky part is this: it seems like the Problem is the Ansicalled format of the files i need to write. Correct - i actually need this Ansi-stuff. About two hours ago i actually managed to write whatever i'd like to an UFT-8 file. Works like charm... until i realized that my Player (Winamp, old Version)somehow doesn't work with those UTF-8 playlist files. It couldn't resolve the Path, even if it looks right in my editor.

我认为棘手的部分是:问题似乎是我需要编写的文件的Ansi格式。正确 - 我实际上需要这个 Ansi 的东西。大约两个小时前,我实际上设法将我想要的任何内容写入 UFT-8 文件。像魅力一样工作......直到我意识到我的播放器(Winamp,旧版本)不知何故不能与那些UTF-8播放列表文件一起使用。它无法解析路径,即使它在我的编辑器中看起来是正确的。

If i change the file format back to Ansi, Paths containing special chars get corrupted. I'm just guessing, but if Winamp reads this UTF-8 files as Ansi, that would cause the Problem i'm experiencing right now.

如果我将文件格式改回 Ansi,包含特殊字符的路径会损坏。我只是猜测,但如果 Winamp 将此 UTF-8 文件读取为 Ansi,那将导致我现在遇到的问题。

So...

所以...

  1. I DO have to write ??ü in a path, or it will not work
  2. It DOES have to be an ANSI-"encoded" file, or it will not work
  3. Things like line.write(str.decode('utf-8'))break the funktion of the file
  4. A magical comment at the beginning of the script like # -*- coding: iso-8859-1 -*-does nothing here (though it is helpful when it comes to the mentioned Metadata and allowed characters in it...)
  5. Oh, and i'm using Python 2.7.3. Third-Party modules dependencies, you know...
  1. 我必须在路径中写 ??ü ,否则它将不起作用
  2. 它必须是一个 ANSI-“编码”的文件,否则它将无法工作
  3. 诸如line.write(str.decode('utf-8'))破坏文件功能之类的事情
  4. 脚本开头的一个神奇的注释就像# -*- coding: iso-8859-1 -*-在这里什么都不做(尽管它对提到的元数据和其中允许的字符很有帮助......)
  5. 哦,我使用的是 Python 2.7.3。第三方模块依赖,你知道...

Is there ANYONE who could guide me towards a way out of this encoding hell? Any help is welcome. If i need 500 lines of Code for another functions or classes, i'll type them. If there's a module for handling such stuff, let me know! I'd buy it! Anything helpful will be tested.

有没有人可以指导我走出这个编码地狱?欢迎任何帮助。如果我需要 500 行代码用于另一个函数或类,我会输入它们。如果有处理此类东西的模块,请告诉我!我会买的!任何有用的东西都会被测试。

Thank you for reading, thanks for any comment,

感谢您的阅读,感谢您的任何评论,

greets!

问候!

采纳答案by Thomas Orozco

As mentioned in the comments, your question isn't very specific, so I'll try to give you some hints about character encodings, see if you can apply those to your specific case!

正如评论中提到的,你的问题不是很具体,所以我会试着给你一些关于字符编码的提示,看看你是否可以将它们应用于你的具体情况!

Unicode and Encoding

Unicode 和编码

Here's a small primer about encoding. Basically, there are two ways to represent text in Python:

这是一个关于编码的小入门。基本上,在 Python 中有两种表示文本的方法:

  • unicode. You can consider that unicodeis the ultimate encoding, you should strive to use it everywhere. In Python 2.x source files, unicodestrings look like u'some unicode'.
  • str. This is encoded text - to be able to read it, you need to know the encoding (or guess it). In Python 2.x, those strings look like 'some str'.
  • unicode. 你可以认为这unicode是终极编码,你应该努力在任何地方使用它。在 Python 2.x 源文件中,unicode字符串看起来像u'some unicode'.
  • str. 这是编码文本 - 为了能够阅读它,您需要知道编码(或猜测它)。在 Python 2.x 中,这些字符串看起来像'some str'.

This changed in Python 3 (unicodeis now strand stris now bytes).

这在 Python 3 中发生了变化(unicode现在strstr现在bytes)。

How does that play out?

那怎么玩呢?

Usually, it's pretty straightforward to ensure that you code uses unicodefor its execution, and uses strfor I/O:

通常,确保您的代码unicode用于其执行和str用于 I/O是非常简单的:

  • Everything you receiveis encoded, so you do input_string.decode('encoding')to convert it to unicode.
  • Everything you need to outputis unicode but needs to be encoded, so you do output_string.encode('encoding').
  • 你的一切得到编码,所以你input_string.decode('encoding')将它转化成unicode
  • 您需要输出的所有内容都是 unicode 但需要进行编码,因此您可以使用output_string.encode('encoding').


The most common encodings are cp-1252on Windows (on US or EU systems), and utf-8on Linux.

最常见的编码是cp-1252在 Windows(在美国或欧盟系统上)和utf-8Linux 上。

Applying this to your case

将此应用于您的案例

I DO have to write ??ü in a path, or it will not work

我必须在路径中写 ??ü ,否则它将不起作用

Windows natively uses unicodefor file paths and names, so you should actually always use unicodefor those.

Windows 本机使用unicode文件路径和名称,因此您实际上应该始终使用unicode它们。

It DOES have to be an ANSI-"encoded" file, or it will not work

它必须是一个 ANSI-“编码”的文件,否则它将无法工作

When you write to the file, be sure to always run your output through output.encode('cp1252')(or whatever encoding ANSIwould be on your system).

当您写入文件时,请确保始终运行您的输出output.encode('cp1252')(或任何编码 ANSI在您的系统上)。

Things like line.write(str.decode('utf-8')) break the funktion of the file

line.write(str.decode('utf-8')) 之类的东西会破坏文件的功能

By now you probably realized that:

现在你可能已经意识到:

  • If stras indeed an strinstance, Python will try to convert it to unicodeusing the utf-8encoding, but then try to encode it again (likely in ascii) to write it to the file
  • If stris actually an unicodeinstance, Python will first encode it (likely in ascii, and that will probably crash) to then be able to decode it.
  • 如果str确实是一个str实例,Python 将尝试将其转换为unicode使用utf-8编码,然后尝试再次对其进行编码(可能在 中ascii)以将其写入文件
  • 如果str实际上是一个unicode实例,Python 将首先对其进行编码(可能在 中ascii,并且可能会崩溃),然后才能对其进行解码。

Bottom line is, you need to know if stris unicode, you should encodeit. If it's already encoded, don't touch it (or decodeit then encodeit if the encoding is not the one you want!).

底线是,您需要知道是否strunicode,您应该encode知道。如果它已经编码,不要去碰它(或decode它,然后encode它,如果编码不是你想要的!)。

A magical comment at the beginning of the script like # -- coding: iso-8859-1 -- does nothing here (though it is helpful when it comes to the mentioned Metadata and allowed characters in it...)

脚本开头的神奇注释,如 # - - coding: iso-8859-1 -- 在这里什么都不做(尽管它对提到的元数据和其中允许的字符很有帮助......)

Not a surprise, this only tells Python what encoding should be used to read your source file so that non-ascii characters are properly recognized.

毫不奇怪,这只是告诉 Python 应该使用什么编码来读取您的源文件,以便正确识别非 ascii 字符。

Oh, and i'm using Python 2.7.3. Third-Party modules dependencies, you know...

哦,我使用的是 Python 2.7.3。第三方模块依赖,你知道...

Python 3 probably is a big update in terms of unicode and encoding, but that doesn't mean Python 2.x can't make it work!

Python 3 在 unicode 和编码方面可能是一个重大更新,但这并不意味着 Python 2.x 不能让它工作!

Will that solve your issue?

那会解决你的问题吗?

You can't be sure, it's possible that the problem lies in the player you're using, not in your code.

您不能确定,问题可能出在您使用的播放器上,而不是出在您的代码中。

Once you output it, you should make sure that your script's output is readable using reference tools (such as Windows Explorer). If it is, but the player still can't open it, you should consider updating to a newer version.

输出后,应确保使用参考工具(例如 Windows 资源管理器)可以读取脚本的输出。如果是,但播放器仍然无法打开它,您应该考虑更新到更新的版本。

回答by wberry

# -*- codingcomments declare the character encoding of the source code (and therefore of byte-string literals like 'abc').

# -*- coding注释声明了源代码的字符编码(因此也是像 那样的字节串文字'abc')。

Assuming that by "playlist" you mean m3ufiles, then based on this specificationyou may be at the mercy of the mp3 player software you are using. This spec says only that the files contain text, no mention of what character encoding.

假设“播放列表”是指m3u文件,那么根据此规范,您可能会受制于您正在使用的 mp3 播放器软件。这个规范只说文件包含文本,没有提到什么字符编码。

I have personally observed that various mp3 encoding software will use different encodings for mp3 metadata. Some use UTF-8, others ISO-8859-1. So you may have to allow encoding to be specified in configuration and leave it at that.

我个人观察到,各种 mp3 编码软件会对 mp3 元数据使用不同的编码。有些使用 UTF-8,有些使用 ISO-8859-1。因此,您可能必须允许在配置中指定编码并将其保留在那里。

回答by Glushiator

On Windows there is special encoding available called mbcs, it converts between current default ANSI codepage and UNICODE. For example on a Spanish Language PC:

在 Windows 上,有一种称为mbcs 的特殊编码可用,它可以在当前默认的 ANSI 代码页和 UNICODE 之间进行转换。例如在西班牙语 PC 上:

u'?'.encode('mbcs') -> '\xf1'
'\xf1'.decode('mbcs') -> u'?'

On Windows ANSI means current default multi-byte code page. For western European languages Windows ISO-8859-1, for eastern European languages windows ISO-8859-2) encoded byte string and other encodings for other languages as appropriate.

在 Windows 上,ANSI 表示当前默认的多字节代码页。对于西欧语言 Windows ISO-8859-1,对于东欧语言 Windows ISO-8859-2) 编码的字节字符串和适用于其他语言的其他编码。

More info available at:

更多信息可在:

https://docs.python.org/2.4/lib/standard-encodings.html

https://docs.python.org/2.4/lib/standard-encodings.html

See also:

也可以看看:

https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding

https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding