C语言 使用 printf 打印 UTF-8 字符串 - 宽与多字节字符串文字

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15528359/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 05:47:41  来源:igfitidea点击:

Printing UTF-8 strings with printf - wide vs. multibyte string literals

cunicodeutf-8printfmultibyte

提问by teppic

In statements like these, where both are entered into the source code with the same encoding (UTF-8) and the locale is set up properly, is there any practical difference between them?

在像这样的语句中,两者都以相同的编码 (UTF-8) 输入到源代码中并且语言环境设置正确,它们之间有什么实际区别吗?

printf("ο Δικαιοπολι? εν αγρω εστιν\n");
printf("%ls", L"ο Δικαιοπολι? εν αγρω εστιν\n");

And consequently is there any reason to prefer one over the other when doing output? I imagine the second performs a fair bit worse, but does it have any advantage (or disadvantage) over a multibyte literal?

因此,在进行输出时是否有任何理由更喜欢一个?我想第二个性能会差一点,但是它比多字节文字有什么优势(或劣势)吗?

EDIT: There are no issues with these strings printing. But I'm not using the wide string functions, because I want to be able to use printfetc. as well. So the question is are these ways of printing any different (given the situation outlined above), and if so, does the second one have any advantage?

编辑:这些字符串打印没有问题。但我没有使用宽字符串函数,因为我也希望能够使用printf等。所以问题是这些打印方式有什么不同(鉴于上面概述的情况),如果是这样,第二种方式有什么优势吗?

EDIT2: Following the comments below, I now know that this program works -- which I thought wasn't possible:

EDIT2:按照下面的评论,我现在知道这个程序有效——我认为这是不可能的:

int main()
{
    setlocale(LC_ALL, "");
    wprintf(L"ο Δικαιοπολι? εν αγρω εστιν\n");  // wide output
    freopen(NULL, "w", stdout);                 // lets me switch
    printf("ο Δικαιοπολι? εν αγρω εστιν\n");    // byte output
}


EDIT3: I've done some further research by looking at what's going on with the two types. Take a simpler string:

EDIT3:我通过查看这两种类型的情况做了一些进一步的研究。取一个更简单的字符串:

wchar_t *wides = L"£100 π";
char *mbs = "£100 π";

The compiler is generating different code. The wide string is:

编译器正在生成不同的代码。宽字符串是:

.string "3"
.string ""
.string ""
.string "1"
.string ""
.string ""
.string "0"
.string ""
.string ""
.string "0"
.string ""
.string ""
.string " "
.string ""
.string ""
.string "0
.string "23100 70"
3" .string "" .string "" .string "" .string "" .string ""

While the second is:

而第二个是:

printf("ο Δικαιοπολι? εν αγρω εστιν\n");

And looking at the Unicode encodings, the second is plain UTF-8. The wide character representation is UTF-32. I realise this is going to be implementation-dependent.

查看 Unicode 编码,第二个是纯 UTF-8。宽字符表示为 UTF-32。我意识到这将取决于实现。

So perhaps the wide character representation of literals is more portable? My system will not print UTF-16/UTF-32 encodings directly, so it is being automatically converted to UTF-8 for output.

那么也许文字的宽字符表示更易于移植?我的系统不会直接打印 UTF-16/UTF-32 编码,所以它会自动转换为 UTF-8 进行输出。

回答by LihO

char str[] = "αγρω";
printf("%d %d\n", sizeof(str), strlen(str));

prints the string literal (const char*, special characters are represented as multibytecharacters). Although you might see the correct output, there are other problems you might be dealing with while working with non-ASCII characters like these. For example:

打印字符串文字 ( const char*,特殊字符表示为多字节字符)。尽管您可能会看到正确的输出,但在使用此类非 ASCII 字符时,您可能会遇到其他问题。例如:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(void)
{
    setlocale(LC_ALL, "");
    printf("%ls", L"ο Δικαιοπολι? εν αγρω εστιν");
    return 0;
}

outputs 9 8, since each of these special characters is represented by 2 chars.

输出9 8,因为这些特殊字符中的每一个都由 2char秒表示。

While using the Lprefix you have the literal consisting of wide characters (const wchar_t*) and %lsformat specifier causes these wide characters to be converted to multibyte characters(UTF-8). Note that in this case, locale should be set appropriately otherwise this conversion might lead to the output being invalid:

使用L前缀时,文字由宽字符 ( const wchar_t*) 和%ls格式说明符组成,导致这些宽字符转换为多字节字符(UTF-8)。请注意,在这种情况下,应适当设置语言环境,否则此转换可能会导致输出无效:

wchar_t str[] = L"αγρω";
printf("%d %d", sizeof(str) / sizeof(wchar_t), wcslen(str));

but while some things might get more complicated when working with wide characters, other things might get much simpler and more straightforward. For example:

但是虽然在处理宽字符时有些事情可能会变得更复杂,但其他事情可能会变得更加简单和直接。例如:

#include <stdio.h>
#include <wchar.h>

#include <io.h>
#include <fcntl.h>
#ifndef _O_U16TEXT
  #define _O_U16TEXT 0x20000
#endif

int main()
{
    _setmode(_fileno(stdout), _O_U16TEXT);
    wprintf(L"%s\n", L"ο Δικαιοπολι? εν αγρω εστιν");
    return 0;
}

will output 5 4as one would naturally expect.

5 4按照人们自然期望的方式输出。

Once you decide to work with wide strings, wprintfcan be used to print wide charactersdirectly. It's also worth to note here that in case of Windows console, the translation mode of the stdoutshould be explicitly set to one of the Unicode modes by calling _setmode:

一旦您决定使用宽字符串,wprintf可用于直接打印宽字符。这里还值得注意的是,在 Windows 控制台的情况下,stdout应通过调用将 的翻译模式显式设置为 Unicode 模式之一_setmode

##代码##