在 Windows 中识别 unicode 编码的文本文件的最佳方法是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4672659/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What's the best way to identify unicode encoded text files in Windows?
提问by HOCA
I am working on a codebase which has some unicode encoded files scattered throughout as a result of multiple team members developing with different editors (and default settings). I would like to clean up our code base by finding all the unicode encoded files and converting them back to ANSI encoding.
我正在开发一个代码库,由于多个团队成员使用不同的编辑器(和默认设置)进行开发,因此其中散布着一些 unicode 编码文件。我想通过查找所有 unicode 编码文件并将它们转换回 ANSI 编码来清理我们的代码库。
Any thoughts on how to accomplish the "finding" part of this task would be truly appreciated.
关于如何完成此任务的“查找”部分的任何想法将不胜感激。
采纳答案by dan04
See “How to detect the character encoding of a text-file?”or “How to reliably guess the encoding [...]?”
请参阅“如何检测文本文件的字符编码?” 或“如何可靠地猜测编码 [...]?”
- UTF-8 can be detected with validation. You can also look for the BOM
EF BB BF
, but don't rely on it. - UTF-16 can be detected by looking for the BOM.
- UTF-32 can be detected by validation, or by the BOM.
- Otherwise assume the ANSI code page.
- UTF-8 可以通过验证来检测。您也可以查找 BOM
EF BB BF
,但不要依赖它。 - 可以通过查找 BOM 来检测 UTF-16。
- UTF-32 可以通过验证或 BOM 检测到。
- 否则假定为 ANSI 代码页。
Our codebase doesn't include any non-ASCII chars. I will try to grep for the BOM in files in our codebase. Thanks for the clarification.
我们的代码库不包含任何非 ASCII 字符。我将尝试为我们代码库中的文件中的 BOM 进行 grep。感谢您的澄清。
Well that makes things a lotsimpler. UTF-8 without non-ASCII chars isASCII.
嗯,这使事情很多简单。没有非 ASCII 字符的 UTF-8是ASCII。
回答by Dour High Arch
Unicode is a standard, it is not an encoding. There are many encodings that implement Unicode, including UTF-8, UTF-16, UCS-2, and others. The translation of any of these encodings to ASCII depends entirely on what encoding your "different editors" use.
Unicode 是一种标准,它不是一种编码。有许多实现 Unicode 的编码,包括 UTF-8、UTF-16、UCS-2 等。将这些编码中的任何一种转换为 ASCII 完全取决于您的“不同编辑器”使用的编码。
Some editors insert byte-order marks of BOMs at the start of Unicode files. If your editors do that, you can use them to detect the encoding.
一些编辑器在 Unicode 文件的开头插入 BOM 的字节顺序标记。如果您的编辑器这样做,您可以使用它们来检测编码。
ANSI is a standards body that has published several encodings for digital character data. The "ANSI" encoding used by MS DOS and supported in Windows is actually CP-1252, not an ANSI standard.
ANSI 是一个标准机构,已经发布了多种数字字符数据编码。MS DOS 使用并在 Windows 中支持的“ANSI”编码实际上是 CP-1252,而不是 ANSI 标准。
Does your codebase include non-ASCII characters? You may have better compatibility using a Unicode encoding rather than an ANSI one or CP-1252.
您的代码库是否包含非 ASCII 字符?使用 Unicode 编码而不是 ANSI 编码或 CP-1252 可能具有更好的兼容性。
回答by John
Actually, if you want to find out in windows if a file is unicode, simply run findstr on the file for a string you know is in there.
实际上,如果您想在 Windows 中找出文件是否为 unicode,只需在文件上运行 findstr 以查找您知道其中的字符串。
findstr /I /C:"SomeKnownString" file.txt
findstr /I /C:"SomeKnownString" file.txt
It will come back empty. Then to be sure, run findstr on a letter or digit you know is in the file:
它会空着回来。然后可以肯定的是,在您知道文件中的字母或数字上运行 findstr :
FindStr /I /C:"P" file.txt
FindStr /I /C:"P" file.txt
You will probably get many occurrences and the key is that they will be spaced apart. This is a sign the file is unicode and not ascii.
您可能会遇到很多情况,关键是它们会分开。这是文件是 unicode 而不是 ascii 的标志。
Hope this helps.
希望这可以帮助。
回答by Luke
If you're looking for a programmatic solution, IsTextUnicode()might be an option.
如果您正在寻找编程解决方案,IsTextUnicode()可能是一个选择。
回答by David Heffernan
It's kind of hard to say, but I'd start by looking for a BOM. Most Windows programs that write Unicode files emit BOMs.
这有点难说,但我会从寻找 BOM 开始。大多数编写 Unicode 文件的 Windows 程序都会发出 BOM。
If these files exist in your codebase presumably they compile. You might ask yourself whether you really need to do this "tidying up". If you do need to do it then I would ask how the tool chain that processes these files discovers their encoding. If you know that then you'll be able to use the same diagnostic.
如果这些文件存在于您的代码库中,大概它们会编译。您可能会问自己是否真的需要进行这种“整理”。如果您确实需要这样做,那么我会问处理这些文件的工具链如何发现它们的编码。如果您知道这一点,那么您将能够使用相同的诊断。