Linux 字符串使用的字符单元数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5117393/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Number of character cells used by string
提问by codemuppet
I have a program that outputs a textual table using UTF-8 strings, and I need to measure the number of monospaced character cells used by a string so I can align it properly. If possible, I'd like to do this with standard functions.
我有一个使用 UTF-8 字符串输出文本表的程序,我需要测量字符串使用的等宽字符单元格的数量,以便我可以正确对齐它。如果可能的话,我想用标准函数来做到这一点。
采纳答案by Maxim Egorushkin
From UTF-8 and Unicode FAQ for Unix/Linux:
来自Unix/Linux 的 UTF-8 和 Unicode 常见问题解答:
The number of characters can be counted in C in a portable way using
mbstowcs(NULL,s,0)
. This works for UTF-8 like for any other supported encoding, as long as the appropriate locale has been selected. A hard-wired technique to count the number of characters in a UTF-8 string is to count all bytes except those in the range 0x80 – 0xBF, because these are just continuation bytes and not characters of their own. However, the need to count characters arises surprisingly rarely in applications.
可以使用 C 以可移植的方式计算字符数
mbstowcs(NULL,s,0)
。这适用于 UTF-8,就像任何其他支持的编码一样,只要选择了适当的语言环境。计算 UTF-8 字符串中字符数的硬接线技术是计算除 0x80 – 0xBF 范围内的字节之外的所有字节,因为这些只是连续字节而不是它们自己的字符。然而,令人惊讶的是,在应用程序中很少需要对字符进行计数。
回答by Nick
If you are able to use 3rd party libraries, have a look at the ICU library from IBM:
如果您能够使用 3rd 方库,请查看 IBM 的 ICU 库:
回答by mpez0
You may or may not have a UTF-8 compatible strlen(3) function available. However, there are some simple C functions readily availablethat do the job quickly.
您可能有也可能没有可用的 UTF-8 兼容 strlen(3) 函数。然而,有一些简单的 C 函数可以快速完成这项工作。
The efficient C solutions examine the start of the character to skip continuation bytes. The simple code (referenced from the link above) is
高效的 C 解决方案检查字符的开头以跳过连续字节。简单的代码(从上面的链接中引用)是
int my_strlen_utf8_c(char *s) {
int i = 0, j = 0;
while (s[i]) {
if ((s[i] & 0xc0) != 0x80) j++;
i++;
}
return j;
}
The faster version uses the same technique, but prefetches data and does multi-byte compares, resulting is a substantial speedup. The code is longer and more complex, however.
更快的版本使用相同的技术,但预取数据并进行多字节比较,结果是显着的加速。然而,代码更长、更复杂。
回答by lmedinas
You can also use glib which makes your live much easier when dealing with UTF-8. glib reference docs
您还可以使用 glib,它可以让您在处理 UTF-8 时更轻松。glib 参考文档
回答by masakielastic
The following code takes ill-formed byte sequences into consideration. the example of string data comes from ""Table 3-8. Use of U+FFFD in UTF-8 Conversion"" in the Unicode Standard 6.3.
以下代码考虑了格式错误的字节序列。字符串数据示例来自“ ”表3-8。U+FFFD 在Unicode 标准 6.3 中的“UTF-8 转换”中的使用。
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#define is_trail(c) (c > 0x7F && c < 0xC0)
#define SUCCESS 1
#define FAILURE -1
int utf8_get_next_char(const unsigned char*, size_t, size_t*, int*, unsigned int*);
int utf8_length(unsigned char*, size_t);
void utf8_print_each_char(unsigned char*, size_t);
int main(void)
{
unsigned char *str;
str = (unsigned char *) "\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64";
size_t str_size = strlen((const char*) str);
puts(10 == utf8_length(str, str_size) ? "true" : "false");
utf8_print_each_char(str, str_size);
return EXIT_SUCCESS;
}
int utf8_length(unsigned char *str, size_t str_size)
{
int length = 0;
size_t pos = 0;
size_t next_pos = 0;
int is_valid = 0;
unsigned int code_point = 0;
while (
utf8_get_next_char(str, str_size, &next_pos, &is_valid, &code_point) == SUCCESS
) {
++length;
}
return length;
}
void utf8_print_each_char(unsigned char *str, size_t str_size)
{
int length = 0;
size_t pos = 0;
size_t next_pos = 0;
int is_valid = 0;
unsigned int code_point = 0;
while (
utf8_get_next_char(str, str_size, &next_pos, &is_valid, &code_point) == SUCCESS
) {
if (is_valid == true) {
printf("%.*s\n", (int) next_pos - (int) pos, str + pos);
} else {
puts("\xEF\xBF\xBD");
}
pos = next_pos;
}
}
int utf8_get_next_char(const unsigned char *str, size_t str_size, size_t *cursor, int *is_valid, unsigned int *code_point)
{
size_t pos = *cursor;
size_t rest_size = str_size - pos;
unsigned char c;
unsigned char min;
unsigned char max;
*code_point = 0;
*is_valid = SUCCESS;
if (*cursor >= str_size) {
return FAILURE;
}
c = str[pos];
if (rest_size < 1) {
*is_valid = false;
pos += 1;
} else if (c < 0x80) {
*code_point = str[pos];
*is_valid = true;
pos += 1;
} else if (c < 0xC2) {
*is_valid = false;
pos += 1;
} else if (c < 0xE0) {
if (rest_size < 2 || !is_trail(str[pos + 1])) {
*is_valid = false;
pos += 1;
} else {
*code_point = ((str[pos] & 0x1F) << 6) | (str[pos + 1] & 0x3F);
*is_valid = true;
pos += 2;
}
} else if (c < 0xF0) {
min = (c == 0xE0) ? 0xA0 : 0x80;
max = (c == 0xED) ? 0x9F : 0xBF;
if (rest_size < 2 || str[pos + 1] < min || max < str[pos + 1]) {
*is_valid = false;
pos += 1;
} else if (rest_size < 3 || !is_trail(str[pos + 2])) {
*is_valid = false;
pos += 2;
} else {
*code_point = ((str[pos] & 0x1F) << 12)
| ((str[pos + 1] & 0x3F) << 6)
| (str[pos + 2] & 0x3F);
*is_valid = true;
pos += 3;
}
} else if (c < 0xF5) {
min = (c == 0xF0) ? 0x90 : 0x80;
max = (c == 0xF4) ? 0x8F : 0xBF;
if (rest_size < 2 || str[pos + 1] < min || max < str[pos + 1]) {
*is_valid = false;
pos += 1;
} else if (rest_size < 3 || !is_trail(str[pos + 2])) {
*is_valid = false;
pos += 2;
} else if (rest_size < 4 || !is_trail(str[pos + 3])) {
*is_valid = false;
pos += 3;
} else {
*code_point = ((str[pos] & 0x7) << 18)
| ((str[pos + 1] & 0x3F) << 12)
| ((str[pos + 2] & 0x3F) << 6)
| (str[pos + 3] & 0x3F);
*is_valid = true;
pos += 4;
}
} else {
*is_valid = false;
pos += 1;
}
*cursor = pos;
return SUCCESS;
}
When I write code for UTF-8, I see "Table 3-7. Well-Formed UTF-8 Byte Sequences" in the Unicode Standard 6.3.
当我为 UTF-8 编写代码时,我看到了 Unicode 标准 6.3 中的“表 3-7. 格式良好的 UTF-8 字节序列”。
Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000 - U+007F 00 - 7F
U+0080 - U+07FF C2 - DF 80 - BF
U+0800 - U+0FFF E0 A0 - BF 80 - BF
U+1000 - U+CFFF E1 - EC 80 - BF 80 - BF
U+D000 - U+D7FF ED 80 - 9F 80 - BF
U+E000 - U+FFFF EE - EF 80 - BF 80 - BF
U+10000 - U+3FFFF F0 90 - BF 80 - BF 80 - BF
U+40000 - U+FFFFF F1 - F3 80 - BF 80 - BF 80 - BF
U+100000 - U+10FFFF F4 80 - 8F 80 - BF 80 - BF
回答by Functino
I'm shocked that no one mentioned this, so here it goes for the record:
我很震惊没有人提到这一点,所以这里记录一下:
If you want to align text in a terminal, you need to use the POSIX functions wcwidth
and wcswidth
. Here's correct program to find the on-screen length of a string.
如果要在终端中对齐文本,则需要使用 POSIX 函数wcwidth
和wcswidth
. 这是查找字符串的屏幕长度的正确程序。
#define _XOPEN_SOURCE
#include <wchar.h>
#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
int measure(char *string) {
// allocate enough memory to hold the wide string
size_t needed = mbstowcs(NULL, string, 0) + 1;
wchar_t *wcstring = malloc(needed * sizeof *wcstring);
if (!wcstring) return -1;
// change encodings
if (mbstowcs(wcstring, string, needed) == (size_t)-1) return -2;
// measure width
int width = wcswidth(wcstring, needed);
free(wcstring);
return width;
}
int main(int argc, char **argv) {
setlocale(LC_ALL, "");
for (int i = 1; i < argc; i++) {
printf("%s: %d\n", argv[i], measure(argv[i]));
}
}
Here's an example of it running:
这是它运行的示例:
$ ./measure hello 莊子 cAb
hello: 5
莊子: 4
cAb: 4
Note how the two characters "莊子" and the three characters "cAb" (note the double-width A) are both 4 columns wide.
请注意“庄子”两个字符和“cAB”三个字符(注意双宽A)都是4列宽。
As utf8everywhere.org puts it,
正如 utf8everywhere.org所说,
The size of the string as it appears on the screen is unrelated to the number of code points in the string. One has to communicate with the rendering engine for this. Code points do not occupy one column even in monospace fonts and terminals. POSIX takes this into account.
出现在屏幕上的字符串大小与字符串中的代码点数无关。为此,必须与渲染引擎进行通信。即使在等宽字体和终端中,代码点也不占一列。POSIX 考虑到了这一点。
Windows does not have any built-in wcwidth
function for console output; if you want to support multi-column characters in the Windows console you need to find a portable implementation of give up because the Windows console doesn't support Unicode without crazy hacks.wcwidth
Windows 没有任何内置的wcwidth
控制台输出函数;如果你想在 Windows 控制台中支持多列字符,你需要找到一个可移植的放弃wcwidth
实现,因为 Windows 控制台不支持 Unicode 除非疯狂的黑客攻击。