使用通用编码检测器 (chardet) 在 Python 中检测文本文件中的字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3323770/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Character detection in a text file in Python using the Universal Encoding Detector (chardet)
提问by u365975
I am trying to use the Universal Encoding Detector (chardet) in Python to detect the most probable character encoding in a text file ('infile') and use that in further processing.
我正在尝试使用 Python 中的通用编码检测器 (chardet) 来检测文本文件 ('infile') 中最可能的字符编码,并在进一步处理中使用它。
While chardet is designed primarily for detecting the character encoding of webpages, I have found an exampleof it being used on individual text files.
虽然 chardet 主要用于检测网页的字符编码,但我发现了一个用于单个文本文件的示例。
However, I cannot work out how to tell the script to set the most likely character encoding to the variable 'charenc' (which is used several times throughout the script).
但是,我无法弄清楚如何告诉脚本将最可能的字符编码设置为变量“charenc”(在整个脚本中多次使用)。
My code, based on a combination of the aforementioned example and chardet's own documentationis as follows:
我的代码基于上述示例和 chardet 自己的文档的组合,如下所示:
import chardet
rawdata=open(infile,"r").read()
chardet.detect(rawdata)
Character detection is necessary as the script goes on to run the following (as well as several similar uses):
随着脚本继续运行以下(以及几个类似的用途),字符检测是必要的:
inF=open(infile,"rb")
s=unicode(inF.read(),charenc)
inF.close()
Any help would be greatly appreciated.
任何帮助将不胜感激。
采纳答案by David Z
chardet.detect()returns a dictionary which provides the encoding as the value associated with the key 'encoding'. So you can do this:
chardet.detect()返回一个字典,它提供编码作为与 key 关联的值'encoding'。所以你可以这样做:
import chardet
rawdata = open(infile, 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
The chardetdocumentationis not explicitly clear about whether text strings and/or byte strings are supposed to work with the module, but it stands to reason that if you have a text string you don't need to run character detection on it, so you should probably be passing byte strings. Hence the binary mode flag (b) in the call to open(). But chardet.detect()might also work with a text string depending on which versions of Python and of the library you're using, i.e. if you do omit the byou might find that it works anyway even though you're technically doing something wrong.
该chardet文件并没有明确清楚是否文本字符串和/或字节串都应该与模块的工作,但它按理说,如果你有你并不需要在其上运行字符检测的文本字符串,所以你应该可能正在传递字节字符串。因此,b调用中的二进制模式标志 ( ) open()。但chardet.detect()也可能使用文本字符串,具体取决于您使用的 Python 版本和库的版本,即如果您确实省略了 ,b您可能会发现它无论如何都可以工作,即使您在技术上做错了。

