Python 如何检测字符串字节编码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15918314/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:22:13  来源:igfitidea点击:

How to detect string byte encoding?

pythonstringunicodeencodingbyte

提问by Phil

I've got about 1000 filenames read by os.listdir(), some of them are encoded in UTF8 and some are CP1252.

我有大约 1000 个文件名被 读取os.listdir(),其中一些用 UTF8 编码,一些是 CP1252。

I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode?

我想将它们全部解码为 Unicode,以便在我的脚本中进一步处理。有没有办法让源编码正确解码为 Unicode?

Example:

例子:

for item in os.listdir(rootPath):

    #Convert to Unicode
    if isinstance(item, str):
        item = item.decode('cp1252')  # or item = item.decode('utf-8')
    print item

采纳答案by lucemia

if your files either in cp1252and utf-8, then there is an easy way.

如果您的文件在cp1252和 中utf-8,那么有一个简单的方法。

import logging
def force_decode(string, codecs=['utf8', 'cp1252']):
    for i in codecs:
        try:
            return string.decode(i)
        except UnicodeDecodeError:
            pass

    logging.warn("cannot decode url %s" % ([string]))

for item in os.listdir(rootPath):
    #Convert to Unicode
    if isinstance(item, str):
        item = force_decode(item)
    print item

otherwise, there is a charset detect lib.

否则,有一个字符集检测库。

Python - detect charset and convert to utf-8

Python - 检测字符集并转换为 utf-8

https://pypi.python.org/pypi/chardet

https://pypi.python.org/pypi/chardet

回答by george

Use chardet library. It is super easy

使用 chardet 库。超级简单

import chardet

the_encoding = chardet.detect('your string')['encoding']

and that's it!

就是这样!