C语言 如何在 C 代码中使用 UTF-8?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30388085/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 11:57:31  来源:igfitidea点击:

How to use UTF-8 in C code?

cutf-8

提问by Igor Liferenko

My setup: gcc-4.9.2, UTF-8 environment.

我的设置:gcc-4.9.2,UTF-8 环境。

The following C-program works in ASCII, but does not in UTF-8.

下面的 C 程序可以在 ASCII 中运行,但不能在 UTF-8 中运行。

Create input file:

创建输入文件:

echo -n 'привет мир' > /tmp/вход

This is test.c:

这是 test.c:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZE 10

int main(void)
{
  char buf[SIZE+1];
  char *pat = "привет мир";
  char str[SIZE+2];

  FILE *f1;
  FILE *f2;

  f1 = fopen("/tmp/вход","r");
  f2 = fopen("/tmp/выход","w");

  if (fread(buf, 1, SIZE, f1) > 0) {
    buf[SIZE] = 0;

    if (strncmp(buf, pat, SIZE) == 0) {
      sprintf(str, "% 11s\n", buf);
      fwrite(str, 1, SIZE+2, f2);
    }
  }

  fclose(f1);
  fclose(f2);

  exit(0);
}

Check the result:

检查结果:

./test; grep -q ' привет мир' /tmp/выход && echo OK

What should be done to make UTF-8 code work as if it was ASCII code - not to bother how many bytes a symbol takes, etc. In other words: what to change in the example to treat any UTF-8 symbol as a single unit (that includes argv, STDIN, STDOUT, STDERR, file input, output and the program code)?

应该怎么做才能使 UTF-8 代码像 ASCII 代码一样工作 - 不要打扰符号占用多少字节等。换句话说:在示例中更改什么以将任何 UTF-8 符号视为单个单元(包括 argv、STDIN、STDOUT、STDERR、文件输入、输出和程序代码)?

回答by Siddhartha Ghosh

#define SIZE 10

The buffer size of 10 is insufficient to store the UTF-8 string привет мир. Try changing it to a larger value. On my system (Ubuntu 12.04, gcc 4.8.1), changing it to 20, worked perfectly.

缓冲区大小 10 不足以存储 UTF-8 字符串привет мир。尝试将其更改为更大的值。在我的系统(Ubuntu 12.04,gcc 4.8.1)上,将其更改为 20,效果很好。

UTF-8 is a multibyte encoding which uses between 1 and 4 bytes per character. So, it is safer to use 40 as the buffer size above. There is a big discussion at How many bytes does one Unicode character take?which might be interesting.

UTF-8 是一种多字节编码,每个字符使用 1 到 4 个字节。因此,使用 40 作为上述缓冲区大小更安全。一个 Unicode 字符占用多少字节有一个很大的讨论这可能很有趣。

回答by Jonathan Leffler

Siddhartha Ghosh's answergives you the basic problem. Fixing your code requires more work, though.

Siddhartha Ghosh回答为您提供了基本问题。不过,修复您的代码需要更多的工作。

I used the following script (chk-utf8-test.sh):

我使用了以下脚本 ( chk-utf8-test.sh):

echo -n 'привет мир' > вход
make utf8-test
./utf8-test
grep -q 'привет мир' выход && echo OK

I called your program utf8-test.cand amended the source like this, removing the references to /tmp, and being more careful with lengths:

我调用了你的程序utf8-test.c并像这样修改了源代码,删除了对 的引用/tmp,并在长度上更加小心:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZE 40

int main(void)
{
    char buf[SIZE + 1];
    char *pat = "привет мир";
    char str[SIZE + 2];

    FILE *f1 = fopen("вход", "r");
    FILE *f2 = fopen("выход", "w");

    if (f1 == 0 || f2 == 0)
    {
        fprintf(stderr, "Failed to open one or both files\n");
        return(1);
    }

    size_t nbytes;
    if ((nbytes = fread(buf, 1, SIZE, f1)) > 0)
    {
        buf[nbytes] = 0;

        if (strncmp(buf, pat, nbytes) == 0)
        {
            sprintf(str, "%.*s\n", (int)nbytes, buf);
            fwrite(str, 1, nbytes, f2);
        }
    }

    fclose(f1);
    fclose(f2);

    return(0);
}

And when I ran the script, I got:

当我运行脚本时,我得到:

$ bash -x chk-utf8-test.sh
+ '[' -f /etc/bashrc ']'
+ . /etc/bashrc
++ '[' -z '' ']'
++ return
+ alias 'r=fc -e -'
+ echo -n 'привет мир'
+ make utf8-test
gcc -O3 -g -std=c11 -Wall -Wextra -Werror utf8-test.c -o utf8-test
+ ./utf8-test
+ grep -q 'привет мир' $'в?3?5од'
+ echo OK
OK
$

For the record, I was using GCC 5.1.0 on Mac OS X 10.10.3.

作为记录,我在 Mac OS X 10.10.3 上使用 GCC 5.1.0。

回答by tripleee

This is more of a corollary to the other answers, but I'll try to explain this from a slightly different angle.

这更像是其他答案的推论,但我会尝试从稍微不同的角度解释这一点。

Here is Jonathan Leffler's version of your code, with three slight changes: (1)I made explicit the actual individual bytes in the UTF-8 strings; and (2)I modified the sprintfformatting string width specifier to hopefully do what you are actually attempting to do. Also tangentially (3)I used perrorto get a slightly more useful error message when something fails.

这是 Jonathan Leffler 的代码版本,有三个细微的变化:(1)我明确指出了 UTF-8 字符串中的实际单个字节;和(2)我修改了sprintf格式化字符串宽度说明符,希望能做你实际尝试做的事情。同样切向(3)我曾经perror在出现故障时收到稍微有用的错误消息。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZE 40

int main(void)
{
  char buf[SIZE + 1];
  char *pat = "071000020512"
    " 040010";  /* "привет мир" */
  char str[SIZE + 2];

  FILE *f1 = fopen("02150604", "r");  /* "вход" */
  FILE *f2 = fopen("0213150604", "w");  /* "выход" */

  if (f1 == 0 || f2 == 0)
    {
      perror("Failed to open one or both files");  /* use perror() */
      return(1);
    }

  size_t nbytes;
  if ((nbytes = fread(buf, 1, SIZE, f1)) > 0)
    {
      buf[nbytes] = 0;

      if (strncmp(buf, pat, nbytes) == 0)
        {
          sprintf(str, "%*s\n", 1+(int)nbytes, buf);  /* nbytes+1 length specifier */
          fwrite(str, 1, 1+nbytes, f2); /* +1 here too */
        }
    }

  fclose(f1);
  fclose(f2);

  return(0);
}

The behavior of sprintfwith a positive numeric width specifier is to pad with spaces from the left, so the space you tried to use is superfluous. But you have to make sure the target field is wider than the string you are printing in order for any padding to actually take place.

sprintf使用正数宽度说明符的行为是从左侧填充空格,因此您尝试使用的空格是多余的。但是您必须确保目标字段比您正在打印的字符串更宽,以便实际进行任何填充。

Just to make this answer self-contained, I will repeat what others have already said. A traditional charis always exactly one byte, but one character in UTF-8 is usually not exactly one byte, except when all your characters are actually ASCII. One of the attractions of UTF-8 is that legacy C code doesn't need to know anything about UTF-8 in order to continue to work, but of course, the assumption that one char is one glyph cannot hold. (As you can see, for example, the glyph пin "привет мир" maps to the two bytes -- and hence, two chars -- "\320\277".)

为了使这个答案自成一体,我将重复其他人已经说过的内容。传统char的总是一个字节,但 UTF-8 中的一个字符通常不完全是一个字节,除非您的所有字符实际上都是 ASCII。UTF-8 的吸引力之一是遗留 C 代码不需要了解任何关于 UTF-8 的信息就可以继续工作,但当然,一个字符是一个字形的假设是不成立的。(例如,如您所见,“привет мир”中的字形п映射到两个字节——因此,两个chars -- "\320\277"。)

This is clearly less than ideal, but demonstrates that you cantreat UTF-8 as "just bytes" if your code doesn't particularly care about glyph semantics. If yours does, you are better off switching to wchar_tas outlined e.g. here: http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html

这显然不太理想,但表明如果您的代码不是特别关心字形语义,您可以将 UTF-8 视为“仅字节”。如果你这样做,你最好切换到wchar_t如这里概述的那样:http: //www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html

However, the standard wchar_tis less than ideal when the standard expectation is UTF-8. See e.g. the GNU libunistring documentationfor a less intrusive alternative, and a bit of background. With that, you should be able to replace charwith uint8_tand the various str*functions with u8_str*replacements and be done. The assumption that one glyph equals one byte will still need to be addressed, but that becomes a minor technicality in your example program. An adaptation is available at http://ideone.com/p0VfXq(though unfortunately the library is not available on http://ideone.com/so it cannot be demonstrated there).

但是,wchar_t当标准期望为 UTF-8 时,该标准并不理想。请参阅例如GNU libunistring 文档以了解侵入性较小的替代方案和一些背景知识。有了这一点,你应该能够取代charuint8_t和各种str*与功能u8_str*置换和完成。一个字形等于一个字节的假设仍然需要解决,但这在您的示例程序中变成了一个次要的技术问题。http://ideone.com/p0VfXq提供了改编版本(尽管遗憾的是该库在http://ideone.com/上不可用,因此无法在那里演示)。

回答by i486

Probably your test.cfile is not stored in UTF-8 format and for that reason "привет мир" string is ASCII - and the comparison failed. Change text encoding of source file and try again.

可能您的test.c文件没有以 UTF-8 格式存储,因此“привет мир”字符串是 ASCII - 并且比较失败。更改源文件的文本编码并重试。

回答by Igor Liferenko

The following code works as required:

以下代码按要求工作:

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
#include <wchar.h>

#define SIZE 10

int main(void)
{
  setlocale(LC_ALL, "");
  wchar_t buf[SIZE+1];
  wchar_t *pat = L"привет мир";
  wchar_t str[SIZE+2];

  FILE *f1;
  FILE *f2;

  f1 = fopen("/tmp/вход","r");
  f2 = fopen("/tmp/выход","w");

  fgetws(buf, SIZE+1, f1);

  if (wcsncmp(buf, pat, SIZE) == 0) {
    swprintf(str, SIZE+2, L"% 11ls", buf);
    fputws(str, f2);
  }

  fclose(f1);
  fclose(f2);

  exit(0);
}