Linux 字符串使用的字符单元数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5117393/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 03:01:24  来源:igfitidea点击:

Number of character cells used by string

clinuxstringutf-8

提问by codemuppet

I have a program that outputs a textual table using UTF-8 strings, and I need to measure the number of monospaced character cells used by a string so I can align it properly. If possible, I'd like to do this with standard functions.

我有一个使用 UTF-8 字符串输出文本表的程序,我需要测量字符串使用的等宽字符单元格的数量,以便我可以正确对齐它。如果可能的话,我想用标准函数来做到这一点。

采纳答案by Maxim Egorushkin

From UTF-8 and Unicode FAQ for Unix/Linux:

来自Unix/Linux 的 UTF-8 和 Unicode 常见问题解答

The number of characters can be counted in C in a portable way using mbstowcs(NULL,s,0). This works for UTF-8 like for any other supported encoding, as long as the appropriate locale has been selected. A hard-wired technique to count the number of characters in a UTF-8 string is to count all bytes except those in the range 0x80 – 0xBF, because these are just continuation bytes and not characters of their own. However, the need to count characters arises surprisingly rarely in applications.

可以使用 C 以可移植的方式计算字符数mbstowcs(NULL,s,0)。这适用于 UTF-8,就像任何其他支持的编码一样,只要选择了适当的语言环境。计算 UTF-8 字符串中字符数的硬接线技术是计算除 0x80 – 0xBF 范围内的字节之外的所有字节,因为这些只是连续字节而不是它们自己的字符。然而,令人惊讶的是,在应用程序中很少需要对字符进行计数。

回答by Nick

If you are able to use 3rd party libraries, have a look at the ICU library from IBM:

如果您能够使用 3rd 方库,请查看 IBM 的 ICU 库:

http://site.icu-project.org/

http://site.icu-project.org/

回答by mpez0

You may or may not have a UTF-8 compatible strlen(3) function available. However, there are some simple C functions readily availablethat do the job quickly.

您可能有也可能没有可用的 UTF-8 兼容 strlen(3) 函数。然而,有一些简单的 C 函数可以快速完成这项工作。

The efficient C solutions examine the start of the character to skip continuation bytes. The simple code (referenced from the link above) is

高效的 C 解决方案检查字符的开头以跳过连续字节。简单的代码(从上面的链接中引用)是

int my_strlen_utf8_c(char *s) {
   int i = 0, j = 0;
   while (s[i]) {
     if ((s[i] & 0xc0) != 0x80) j++;
     i++;
   }
   return j;
}

The faster version uses the same technique, but prefetches data and does multi-byte compares, resulting is a substantial speedup. The code is longer and more complex, however.

更快的版本使用相同的技术,但预取数据并进行多字节比较,结果是显着的加速。然而,代码更长、更复杂。

回答by lmedinas

You can also use glib which makes your live much easier when dealing with UTF-8. glib reference docs

您还可以使用 glib,它可以让您在处理 UTF-8 时更轻松。glib 参考文档

回答by masakielastic

The following code takes ill-formed byte sequences into consideration. the example of string data comes from ""Table 3-8. Use of U+FFFD in UTF-8 Conversion"" in the Unicode Standard 6.3.

以下代码考虑了格式错误的字节序列。字符串数据示例来自“ ”表3-8。U+FFFD 在Unicode 标准 6.3 中的“UTF-8 转换”中的使用。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

#define is_trail(c) (c > 0x7F && c < 0xC0)
#define SUCCESS 1
#define FAILURE -1

int utf8_get_next_char(const unsigned char*, size_t, size_t*, int*, unsigned int*);
int utf8_length(unsigned char*, size_t);
void utf8_print_each_char(unsigned char*, size_t);

int main(void)
{
    unsigned char *str;
    str = (unsigned char *) "\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64";
    size_t str_size = strlen((const char*) str);

    puts(10 == utf8_length(str, str_size) ? "true" : "false");
    utf8_print_each_char(str, str_size);

    return EXIT_SUCCESS;
}

int utf8_length(unsigned char *str, size_t str_size)
{
    int length = 0;
    size_t pos = 0;
    size_t next_pos = 0;
    int is_valid = 0;
    unsigned int code_point = 0;

    while (
        utf8_get_next_char(str, str_size, &next_pos, &is_valid, &code_point) == SUCCESS
    ) {
        ++length;
    }

    return length;
}

void utf8_print_each_char(unsigned char *str, size_t str_size)
{
    int length = 0;
    size_t pos = 0;
    size_t next_pos = 0;
    int is_valid = 0;
    unsigned int code_point = 0;

    while (
        utf8_get_next_char(str, str_size, &next_pos, &is_valid, &code_point) == SUCCESS
    ) {
        if (is_valid == true) {
            printf("%.*s\n", (int) next_pos - (int) pos, str + pos);
        } else {
            puts("\xEF\xBF\xBD");
        }

        pos = next_pos;
    }
}

int utf8_get_next_char(const unsigned char *str, size_t str_size, size_t *cursor, int *is_valid, unsigned int *code_point)
{
    size_t pos = *cursor;
    size_t rest_size = str_size - pos;
    unsigned char c;
    unsigned char min;
    unsigned char max;

    *code_point = 0;
    *is_valid = SUCCESS;

    if (*cursor >= str_size) {
        return FAILURE;
    }

    c = str[pos];

    if (rest_size < 1) {
        *is_valid = false;
        pos += 1;
    } else if (c < 0x80) {
        *code_point = str[pos];
        *is_valid = true;
        pos += 1;
    } else if (c < 0xC2) {
        *is_valid = false;
        pos += 1;
    } else if (c < 0xE0) {

        if (rest_size < 2 || !is_trail(str[pos + 1])) {
            *is_valid = false;
            pos += 1;
        } else {
            *code_point = ((str[pos] & 0x1F) << 6) | (str[pos + 1] & 0x3F);
            *is_valid = true;
            pos += 2;
        }

    } else if (c < 0xF0) {

        min = (c == 0xE0) ? 0xA0 : 0x80;
        max = (c == 0xED) ? 0x9F : 0xBF;

        if (rest_size < 2 || str[pos + 1] < min || max < str[pos + 1]) {
            *is_valid = false;
            pos += 1;         
        } else if (rest_size < 3 || !is_trail(str[pos + 2])) {
            *is_valid = false;
            pos += 2;
        } else {
            *code_point = ((str[pos]     & 0x1F) << 12) 
                       | ((str[pos + 1] & 0x3F) <<  6) 
                       |  (str[pos + 2] & 0x3F);
            *is_valid = true;
            pos += 3;
        }

    } else if (c < 0xF5) {

        min = (c == 0xF0) ? 0x90 : 0x80;
        max = (c == 0xF4) ? 0x8F : 0xBF;

        if (rest_size < 2 || str[pos + 1] < min || max < str[pos + 1]) {
            *is_valid = false;
            pos += 1;
        } else if (rest_size < 3 || !is_trail(str[pos + 2])) {
            *is_valid = false;
            pos += 2;
        } else if (rest_size < 4 || !is_trail(str[pos + 3])) {
            *is_valid = false;
            pos += 3;
        } else {
            *code_point = ((str[pos]     &  0x7) << 18)
                       | ((str[pos + 1] & 0x3F) << 12)
                       | ((str[pos + 2] & 0x3F) << 6)
                       |  (str[pos + 3] & 0x3F);
            *is_valid = true;
            pos += 4;
        }

    } else {
        *is_valid = false;
        pos += 1;
    }

    *cursor = pos;

    return SUCCESS;
}

When I write code for UTF-8, I see "Table 3-7. Well-Formed UTF-8 Byte Sequences" in the Unicode Standard 6.3.

当我为 UTF-8 编写代码时,我看到了 Unicode 标准 6.3 中的“表 3-7. 格式良好的 UTF-8 字节序列”。

       Code Points    First Byte Second Byte Third Byte Fourth Byte
  U+0000 -   U+007F   00 - 7F
  U+0080 -   U+07FF   C2 - DF    80 - BF
  U+0800 -   U+0FFF   E0         A0 - BF     80 - BF
  U+1000 -   U+CFFF   E1 - EC    80 - BF     80 - BF
  U+D000 -   U+D7FF   ED         80 - 9F     80 - BF
  U+E000 -   U+FFFF   EE - EF    80 - BF     80 - BF
 U+10000 -  U+3FFFF   F0         90 - BF     80 - BF    80 - BF
 U+40000 -  U+FFFFF   F1 - F3    80 - BF     80 - BF    80 - BF
U+100000 - U+10FFFF   F4         80 - 8F     80 - BF    80 - BF

回答by Functino

I'm shocked that no one mentioned this, so here it goes for the record:

我很震惊没有人提到这一点,所以这里记录一下:

If you want to align text in a terminal, you need to use the POSIX functions wcwidthand wcswidth. Here's correct program to find the on-screen length of a string.

如果要在终端中对齐文本,则需要使用 POSIX 函数wcwidthwcswidth. 这是查找字符串的屏幕长度的正确程序。

#define _XOPEN_SOURCE
#include <wchar.h>
#include <stdio.h>
#include <locale.h>
#include <stdlib.h>

int measure(char *string) {
    // allocate enough memory to hold the wide string
    size_t needed = mbstowcs(NULL, string, 0) + 1;
    wchar_t *wcstring = malloc(needed * sizeof *wcstring);
    if (!wcstring) return -1;

    // change encodings
    if (mbstowcs(wcstring, string, needed) == (size_t)-1) return -2;

    // measure width
    int width = wcswidth(wcstring, needed);

    free(wcstring);
    return width;
}

int main(int argc, char **argv) {
    setlocale(LC_ALL, "");

    for (int i = 1; i < argc; i++) {
        printf("%s: %d\n", argv[i], measure(argv[i]));
    }
}

Here's an example of it running:

这是它运行的示例:

$ ./measure hello 莊子 cAb
hello: 5
莊子: 4
cAb: 4

Note how the two characters "莊子" and the three characters "cAb" (note the double-width A) are both 4 columns wide.

请注意“庄子”两个字符和“cAB”三个字符(注意双宽A)都是4列宽。

As utf8everywhere.org puts it,

正如 utf8everywhere.org所说

The size of the string as it appears on the screen is unrelated to the number of code points in the string. One has to communicate with the rendering engine for this. Code points do not occupy one column even in monospace fonts and terminals. POSIX takes this into account.

出现在屏幕上的字符串大小与字符串中的代码点数无关。为此,必须与渲染引擎进行通信。即使在等宽字体和终端中,代码点也不占一列。POSIX 考虑到了这一点。

Windows does not have any built-in wcwidthfunction for console output; if you want to support multi-column characters in the Windows console you need to find a portable implementation of wcwidthgive up because the Windows console doesn't support Unicode without crazy hacks.

Windows 没有任何内置的wcwidth控制台输出函数;如果你想在 Windows 控制台中支持多列字符,你需要找到一个可移植的wcwidth放弃实现,因为 Windows 控制台不支持 Unicode 除非疯狂的黑客攻击。