Python 如何检测字符串字节编码?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15918314/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to detect string byte encoding?
提问by Phil
I've got about 1000 filenames read by os.listdir(), some of them are encoded in UTF8 and some are CP1252.
我有大约 1000 个文件名被 读取os.listdir(),其中一些用 UTF8 编码,一些是 CP1252。
I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode?
我想将它们全部解码为 Unicode,以便在我的脚本中进一步处理。有没有办法让源编码正确解码为 Unicode?
Example:
例子:
for item in os.listdir(rootPath):
#Convert to Unicode
if isinstance(item, str):
item = item.decode('cp1252') # or item = item.decode('utf-8')
print item
采纳答案by lucemia
if your files either in cp1252and utf-8, then there is an easy way.
如果您的文件在cp1252和 中utf-8,那么有一个简单的方法。
import logging
def force_decode(string, codecs=['utf8', 'cp1252']):
for i in codecs:
try:
return string.decode(i)
except UnicodeDecodeError:
pass
logging.warn("cannot decode url %s" % ([string]))
for item in os.listdir(rootPath):
#Convert to Unicode
if isinstance(item, str):
item = force_decode(item)
print item
otherwise, there is a charset detect lib.
否则,有一个字符集检测库。
Python - detect charset and convert to utf-8
回答by george
Use chardet library. It is super easy
使用 chardet 库。超级简单
import chardet
the_encoding = chardet.detect('your string')['encoding']
and that's it!
就是这样!

