C++ 获取 UTF-8 编码的 std::string 的实际长度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4063146/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Getting the actual length of a UTF-8 encoded std::string?
提问by jmasterx
my std::string is utf-8 encoded so obviously, str.length() returns the wrong result.
我的 std::string 显然是 utf-8 编码的,str.length() 返回错误的结果。
I found this information but I'm not sure how I can use it to do this:
我找到了这个信息,但我不确定如何使用它来做到这一点:
The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character:
0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
以下字节序列用于表示一个字符。要使用的序列取决于字符的 UCS 代码编号:
0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
How can I find the actual length of a UTF-8 encoded std::string? Thanks
如何找到 UTF-8 编码的 std::string 的实际长度?谢谢
回答by Marcelo Cantos
Count all first-bytes (the ones that don't match 10xxxxxx).
计算所有第一个字节(与 10xxxxxx 不匹配的字节)。
int len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;
回答by user2781185
C++ knows nothing about encodings, so you can't expect to use a standard function to do this.
C++ 对编码一无所知,因此您不能期望使用标准函数来执行此操作。
The standard library indeed doesacknowledge the existence of character encodings, in the form of locales. If your system supports a locale, it is very easy to use the standard library to compute the length of a string. In the example code below I assume your system supports the locale en_US.utf8. If I compile the code and execute it as "./a.out ソニーSony", the output is that there were 13 char-values and 7 characters. And all without any reference to the internal representation of UTF-8 character codes or having to use 3rd party libraries.
标准库确实不承认字符编码的存在,在语言环境的形式。如果您的系统支持区域设置,则使用标准库来计算字符串的长度非常容易。在下面的示例代码中,我假设您的系统支持区域设置 en_US.utf8。如果我编译代码并将其作为“./a.out ソニーSony”执行,输出是有 13 个字符值和 7 个字符。并且所有这些都没有参考 UTF-8 字符代码的内部表示或必须使用 3rd 方库。
#include <clocale>
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char *argv[])
{
string str(argv[1]);
unsigned int strLen = str.length();
cout << "Length (char-values): " << strLen << '\n';
setlocale(LC_ALL, "en_US.utf8");
unsigned int u = 0;
const char *c_str = str.c_str();
unsigned int charCount = 0;
while(u < strLen)
{
u += mblen(&c_str[u], strLen - u);
charCount += 1;
}
cout << "Length (characters): " << charCount << endl;
}
回答by Charles Salvia
You should probably take the advice of Omry and look into a specialized library for this. That said, if you just want to understand the algorithm to do this, I'll post it below.
您可能应该听取 Omry 的建议,并为此查看专门的图书馆。也就是说,如果您只是想了解执行此操作的算法,我将在下面发布。
Basically, you can convert your string into a wider-element format, such as wchar_t
. Note that wchar_t
has a few portability issues, because wchar_t
is of varying size depending on your platform. On Windows, wchar_t
is 2 bytes, and therefore ideal for representing UTF-16. But on UNIX/Linux, it's four-bytes and is therefore used to represent UTF-32. Therefore, for Windows this will only work if you don't include any Unicode codepoints above 0xFFFF. For Linux you can include the entire range of codepoints in a wchar_t
. (Fortunately, this issue will be mitigated with the C++0x Unicode character types.)
基本上,您可以将字符串转换为更宽的元素格式,例如wchar_t
. 请注意,wchar_t
它有一些可移植性问题,因为wchar_t
它的大小取决于您的平台。在 Windows 上,wchar_t
是 2 个字节,因此非常适合表示 UTF-16。但在 UNIX/Linux 上,它是四字节,因此用于表示 UTF-32。因此,对于 Windows,这仅在您不包含任何高于 0xFFFF 的 Unicode 代码点时才有效。对于 Linux,您可以在wchar_t
. (幸运的是,这个问题将通过 C++0x Unicode 字符类型得到缓解。)
With that caveat noted, you can create a conversion function using the following algorithm:
注意到该警告后,您可以使用以下算法创建转换函数:
template <class OutputIterator>
inline OutputIterator convert(const unsigned char* it, const unsigned char* end, OutputIterator out)
{
while (it != end)
{
if (*it < 192) *out++ = *it++; // single byte character
else if (*it < 224 && it + 1 < end && *(it+1) > 127) {
// double byte character
*out++ = ((*it & 0x1F) << 6) | (*(it+1) & 0x3F);
it += 2;
}
else if (*it < 240 && it + 2 < end && *(it+1) > 127 && *(it+2) > 127) {
// triple byte character
*out++ = ((*it & 0x0F) << 12) | ((*(it+1) & 0x3F) << 6) | (*(it+2) & 0x3F);
it += 3;
}
else if (*it < 248 && it + 3 < end && *(it+1) > 127 && *(it+2) > 127 && *(it+3) > 127) {
// 4-byte character
*out++ = ((*it & 0x07) << 18) | ((*(it+1) & 0x3F) << 12) |
((*(it+2) & 0x3F) << 6) | (*(it+3) & 0x3F);
it += 4;
}
else ++it; // Invalid byte sequence (throw an exception here if you want)
}
return out;
}
int main()
{
std::string s = "\u00EAtre";
cout << s.length() << endl;
std::wstring output;
convert(reinterpret_cast<const unsigned char*> (s.c_str()),
reinterpret_cast<const unsigned char*>(s.c_str()) + s.length(), std::back_inserter(output));
cout << output.length() << endl; // Actual length
}
The algorithm isn't fully generic, because the InputIterator needs to be an unsigned char, so you can interpret each byte as having a value between 0 and 0xFF. The OutputIterator is generic, (just so you can use an std::back_inserter and not worry about memory allocation), but its use as a generic parameter is limited: basically, it has to output to an array of elements large enough to represent a UTF-16 or UTF-32 character, such as wchar_t
, uint32_t
or the C++0x char32_t
types. Also, I didn't include code to convert character byte sequences greater than 4 bytes, but you should get the point of how the algorithm works from what's posted.
该算法不是完全通用的,因为 InputIterator 需要是无符号字符,因此您可以将每个字节解释为具有 0 到 0xFF 之间的值。OutputIterator 是通用的(这样你就可以使用 std::back_inserter 而不必担心内存分配),但它作为通用参数的使用是有限的:基本上,它必须输出到一个足够大的元素数组来表示一个UTF-16 或 UTF-32 字符,例如wchar_t
,uint32_t
或 C++0xchar32_t
类型。此外,我没有包含转换大于 4 个字节的字符字节序列的代码,但是您应该从发布的内容中了解算法的工作原理。
Also, if you just want to countthe number of characters, rather than output to a new wide-character buffer, you can modify the algorithm to include a counter rather than an OutputIterator. Or better yet, just use Marcelo Cantos' answerto count the first-bytes.
此外,如果您只想计算字符数,而不是输出到新的宽字符缓冲区,则可以修改算法以包含计数器而不是 OutputIterator。或者更好的是,只需使用Marcelo Cantos 的答案来计算第一个字节。
回答by Lucas
I recommend you use UTF8-CPP. It's a header-only library for working with UTF-8 in C++. With this lib, it would look something like this:
我建议您使用UTF8-CPP。它是一个仅用于在 C++ 中使用 UTF-8 的头文件库。有了这个库,它看起来像这样:
int LenghtOfUtf8String( const std::string &utf8_string )
{
return utf8::distance( utf8_string.begin(), utf8_string.end() );
}
(Code is from the top of my head.)
(代码来自我的头顶。)
回答by Lucas
This is a naive implementation, but it should be helpful for you to see how this is done:
这是一个简单的实现,但它应该有助于您了解这是如何完成的:
std::size_t utf8_length(std::string const &s) {
std::size_t len = 0;
std::string::const_iterator begin = s.begin(), end = s.end();
while (begin != end) {
unsigned char c = *begin;
int n;
if ((c & 0x80) == 0) n = 1;
else if ((c & 0xE0) == 0xC0) n = 2;
else if ((c & 0xF0) == 0xE0) n = 3;
else if ((c & 0xF8) == 0xF0) n = 4;
else throw std::runtime_error("utf8_length: invalid UTF-8");
if (end - begin < n) {
throw std::runtime_error("utf8_length: string too short");
}
for (int i = 1; i < n; ++i) {
if ((begin[i] & 0xC0) != 0x80) {
throw std::runtime_error("utf8_length: expected continuation byte");
}
}
len += n;
begin += n;
}
return len;
}
回答by Omry Yadan
回答by Gem Taylor
A slightly lazy approach would be to only count lead bytes, but visit every byte. This saves the complexity of decoding the various lead byte sizes, but obviously you pay to visit all the bytes, though there usually aren't that many (2x-3x):
一个稍微懒惰的方法是只计算前导字节,但访问每个字节。这节省了解码各种前导字节大小的复杂性,但显然您需要支付访问所有字节的费用,尽管通常没有那么多(2x-3x):
size_t utf8Len(std::string s)
{
return std::count_if(s.begin(), s.end(),
[](char c) { return (static_cast<unsigned char>(c) & 0xC0) != 0x80; } );
}
Note that certain code values are illegal as lead bytes, those that represent bigger values than the 20 bits needed for extended unicode, for example, but then the other approach would not know how to deal with that code, anyway.
请注意,某些代码值作为前导字节是非法的,例如,那些表示比扩展 unicode 所需的 20 位更大的值的代码值,但是其他方法无论如何都不知道如何处理该代码。
回答by PhotonFalcon
Most of my personal C library code has only been really tested in English, but here is how I've implemented my utf-8 string length function. I originally based it on the bit pattern described in this wiki page table. Now this isn't the most readable code, but I do like the benchmarkbetter from my compiler. Also sorry for this being C code, it should translate over to std::string in C++ pretty easily though with some slight modifications :).
我的大部分个人 C 库代码只用英语进行了真正的测试,但这里是我实现 utf-8 字符串长度函数的方法。我最初是基于这个 wiki 页表中描述的位模式。现在这不是最易读的代码,但我确实更喜欢我的编译器的基准测试。也很抱歉这是 C 代码,它应该很容易地转换为 C++ 中的 std::string,尽管有一些轻微的修改:)。
size_t utf8len(const char* const str)
{
size_t len = 0;
unsigned char c = str[0];
for (size_t i = 1; c != 0; ++len, ++i)
{
if ((c & 0x80))
{
if (c < 0xC0) // Invalid increment
return 0;
c >>= 4;
if (c == 12)
c++;
i += c - 12;
}
c = str[i];
}
return len;
}
Note that this does not validate any of the bytes (much like all the other suggested answers here). Personally I would separate string validation out of my string length function as that is not it's responsibility. If we were to move string validation to another function we could have the validation done something like the following.
请注意,这不会验证任何字节(与此处的所有其他建议答案非常相似)。我个人会将字符串验证从我的字符串长度函数中分离出来,因为这不是它的责任。如果我们要将字符串验证移动到另一个函数,我们可以像下面这样完成验证。
bool utf8valid(const char* const str)
{
if (str == NULL)
return false;
unsigned char c = str[0];
for (size_t i = 1, inc = 0; c != 0; ++i)
{
if (inc > 1)
{
if ((c & 0xC0) != 0x80)
return false;
inc--;
}
else
{
inc = 1;
if ((c & 0x80))
{
if (c < 0xC0 || c >= 0xF8)
return false;
c >>= 4;
if (c == 12)
c++;
inc += c - 12;
}
}
c = str[i];
}
return true;
}
If you are going for readability, I'll admit that other suggestions are a quite bit more readable haha!
如果您要提高可读性,我承认其他建议更具可读性,哈哈!
回答by Nemanja Trifunovic
UTF-8 CPP library has a function that does just that. You can either include the library into your project (it is small) or just look at the function. http://utfcpp.sourceforge.net/
UTF-8 CPP 库有一个功能可以做到这一点。您可以将库包含到您的项目中(它很小)或只查看函数。http://utfcpp.sourceforge.net/
char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);
回答by twotrees
This code I'm porting from php-iconv to c++, you need use iconv first, hope usefull:
这段代码我从 php-iconv 移植到 C++,你需要先使用 iconv,希望有用:
// porting from PHP
// http://lxr.php.net/xref/PHP_5_4/ext/iconv/iconv.c#_php_iconv_strlen
#define GENERIC_SUPERSET_NBYTES 4
#define GENERIC_SUPERSET_NAME "UCS-4LE"
UInt32 iconvStrlen(const char *str, size_t nbytes, const char* encode)
{
UInt32 retVal = (unsigned int)-1;
unsigned int cnt = 0;
iconv_t cd = iconv_open(GENERIC_SUPERSET_NAME, encode);
if (cd == (iconv_t)(-1))
return retVal;
const char* in;
size_t inLeft;
char *out;
size_t outLeft;
char buf[GENERIC_SUPERSET_NBYTES * 2] = {0};
for (in = str, inLeft = nbytes, cnt = 0; inLeft > 0; cnt += 2)
{
size_t prev_in_left;
out = buf;
outLeft = sizeof(buf);
prev_in_left = inLeft;
if (iconv(cd, &in, &inLeft, (char **) &out, &outLeft) == (size_t)-1) {
if (prev_in_left == inLeft) {
break;
}
}
}
iconv_close(cd);
if (outLeft > 0)
cnt -= outLeft / GENERIC_SUPERSET_NBYTES;
retVal = cnt;
return retVal;
}
UInt32 utf8StrLen(const std::string& src)
{
return iconvStrlen(src.c_str(), src.length(), "UTF-8");
}