windows 检测 C/C++ 中字符串的编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7523217/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-15 18:08:22  来源:igfitidea点击:

Detect encoding of a string in C/C++

windowsvisual-c++character-encoding

提问by HymanOdE

Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.

给定一个指向字节数组(字符)的指针形式的字符串,如何检测 C/C++ 中字符串的编码(我使用的是 Visual Studio 2008)?我进行了搜索,但大多数示例都是用 C# 完成的。

Thanks

谢谢

回答by MSN

Assuming you know the length of the input array, you can make the following guesses:

假设您知道输入数组的长度,您可以进行以下猜测:

  1. First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
  2. Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.
  3. If any character is from 0x80to 0xff, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character setit is. That will not be fun.
  4. At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.
  1. 首先,检查前几个字节是否与Unicode 的任何众所周知的字节顺序标记 (BOM)匹配。如果他们这样做,你就完成了!
  2. 接下来,在最后一个字节之前搜索 '\0'。如果您找到了,您可能正在处理 UTF-16 或 UTF-32。如果您发现多个连续的 '\0's,则可能是 UTF-32。
  3. 如果任何字符来自0x80to 0xff,它肯定不是 ASCII 或 UTF-7。如果您将输入限制为 Unicode 的某些变体,则可以假设它是 UTF-8。否则,您必须进行一些猜测以确定它是哪个多字节字符集。那不会很有趣。
  4. 此时它是:ASCII、UTF-7、Base64 或 UTF-16 或 UTF-32 的范围,恰好不使用最高位并且没有任何空字符。

回答by russw_uk

It's not an easy problem to solve, and generally relies on heuristics to take a best guess at what the input encoding is, which can be tripped up by relatively innocuous inputs - for example, take a look at this Wikipedia articleand The Notepad file encoding Reduxfor more details.

这不是一个容易解决的问题,通常依靠启发式方法来对输入编码是什么做出最佳猜测,这可能会被相对无害的输入绊倒 - 例如,看看这篇维基百科文章记事本文件编码Redux了解更多详情。

If you're looking for a Windows-only solution with minimal dependencies, you can look at using a combination of IsTextUnicodeand MLang's DetectInputCodePageto attempt character set detection.

如果您正在寻找具有最小依赖性的仅 Windows 解决方案,您可以考虑使用IsTextUnicode和 MLang 的DetectInputCodePage的组合来尝试字符集检测。

If you are looking for portability, but don't mind taking on a fairly large dependency in the form of ICU then you can make use of it's character set detectionroutines to achieve the same thing in a portable manner.

如果您正在寻找可移植性,但不介意以 ICU 的形式承担相当大的依赖性,那么您可以利用它的字符集检测例程以可移植的方式实现相同的目标。