string 将 UTF-8 字符串存储在 UnicodeString 中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2697843/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 00:43:01  来源:igfitidea点击:

Storing UTF-8 string in a UnicodeString

stringdelphiunicodeutf-8utf-16

提问by Mick

In Delphi 2007 you can store a UTF-8 string in a WideString and then pass that onto a Win32 function, e.g.

在 Delphi 2007 中,您可以将 UTF-8 字符串存储在 WideString 中,然后将其传递给 Win32 函数,例如

var
  UnicodeStr: WideString;
  UTF8Str: WideString;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

Delphi 2007 does not interfere with the contents of UTF8Str, i.e. it is left as a UTF-8 encoded string stored in a WideString.

Delphi 2007 不会干扰 UTF8Str 的内容,即它是作为 UTF-8 编码字符串存储在 WideString 中的。

But in Delphi 2010 I'm struggling to find a way to do the same thing, i.e. store a UTF-8 encoded string in a WideString without it being automatically converted from UTF-8. I cannot pass a pointer to a UTF-8 string (or RawByteString), e.g. the following will obviously not work:

但是在 Delphi 2010 中,我正在努力寻找一种方法来做同样的事情,即将 UTF-8 编码的字符串存储在 WideString 中,而不是从 UTF-8 自动转换。我无法传递指向 UTF-8 字符串(或 RawByteString)的指针,例如以下内容显然不起作用:

var
  UnicodeStr: WideString;
  UTF8Str: UTF8String;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

回答by Zo? Peterson

Your original Delphi 2007 code was converting the UTF-8 string to a widestring using the ANSI codepage. To do the same thing in Delphi 2010 you should use SetCodePage with the Convert parameter false.

您的原始 Delphi 2007 代码使用 ANSI 代码页将 UTF-8 字符串转换为宽字符串。要在 Delphi 2010 中做同样的事情,你应该使用 SetCodePage 和 Convert 参数 false。

var
  UnicodeStr: UnicodeString;
  UTF8Str: RawByteString;
begin
  UTF8Str := UTF8Encode('some unicode text');
  SetCodePage(UTF8Str, 0, False);
  UnicodeStr := UTF8Str;
  Windows.SomeFunction(PWideChar(UnicodeStr), ...)

回答by Runner

Hmm, why are you doing that? Why are you encoding a WideString to UTF-8 just to store it again back to WideString. You are obviously using a Unicode version of the Windows API. So there is no need to use a UTF-8-encoded string. Or am I missing something.

嗯,你为什么要这样做?为什么要将 WideString 编码为 UTF-8 只是为了将其再次存储回 WideString。您显然使用的是 Unicode 版本的 Windows API。所以没有必要使用 UTF-8 编码的字符串。或者我错过了什么。

Because Windows API functions are either Unicode (two bytes) or ANSI (one byte). UTF-8 would be wrong choice here, because mainly it contains one byte per character, but for characters above the ASCII base it uses two or more bytes.

因为 Windows API 函数要么是 Unicode(两个字节),要么是 ANSI(一个字节)。UTF-8 在这里是错误的选择,因为主要是它每个字符包含一个字节,但对于 ASCII 基础以上的字符,它使用两个或更多字节。

Otherwise the equivalent for your old code in unicode Delphi would be:

否则,您在 unicode Delphi 中的旧代码的等价物将是:

var
  UnicodeStr: string;
  UTF8Str: string;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

WideString and string (UnicodeString) are similar, but the new UnicodeString is faster because it is reference-counted and WideString is not.

WideString 和字符串 (UnicodeString) 类似,但新的 UnicodeString 更快,因为它是引用计数的,而 WideString 不是。

You code was not correct because the UTF-8 string has a variable number of bytes per character. "A" is stored as one byte. Just an ASCII byte code. "ü" on the other hand would be stored as two bytes. And because you are then using PWideChar the function always expects two bytes per character.

您的代码不正确,因为 UTF-8 字符串每个字符的字节数是可变的。“A”存储为一个字节。只是一个 ASCII 字节码。另一方面,“ü”将存储为两个字节。并且因为您当时使用的是 PWideChar,该函数始终需要每个字符两个字节。

There is another difference. In older Delphi versions (ANSI) Utf8String was just an AnsiString. In Unicode versions of Delphi Utf8String is a string with a UTF-8 code page behind it. So it behaves differently.

还有一个区别。在旧的 Delphi 版本 (ANSI) 中,Utf8String 只是一个 AnsiString。在 Delphi 的 Unicode 版本中, Utf8String 是一个带有 UTF-8 代码页的字符串。所以它的行为不同。

The old code would still work correctly:

旧代码仍然可以正常工作:

var
  UnicodeStr: WideString;
  UTF8Str: WideString;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

It would act the same as it did in Delphi 2007. So maybe you have a problem elsewhere.

它的行为与在 Delphi 2007 中的行为相同。所以也许您在其他地方遇到了问题。

Mick you are correct. The compiler does some extra work behind the scenes. So in order to avoid this you can do something like this:

米克你是对的。编译器在幕后做了一些额外的工作。因此,为了避免这种情况,您可以执行以下操作:

var
  UTF8Str: AnsiString;
  UnicodeStr: WideString;
  TempString: RawByteString;
  ResultString: WideString;
begin
  UnicodeStr := 'some unicode text';
  TempString := UTF8Encode(UnicodeStr);
  SetLength(UTF8Str, Length(TempString));
  Move(TempString[1], UTF8Str[1], Length(UTF8Str));
  ResultString := UTF8Str;
end;

I checked, and it works just the same. Because I move bytes directly in memory there is no codepage conversion done in the background. I am sure it can be done with greater eleganece, but the point is that I see this as the way for what you want to achieve.

我查过,它的工作原理是一样的。因为我直接在内存中移动字节,所以没有在后台完成代码页转换。我相信它可以通过更优雅的方式完成,但关键是我认为这是实现您想要实现的目标的方式。

回答by The_Fox

Which Windows API call wants you to pass a UTF-8 string? It is either an ANSI string or a Widestring (A or W functions). Widestrings have two bytes per character, and UTF-8 strings have one (or more if you beyond the first 128 ASCII characters).

哪个 Windows API 调用希望您传递 UTF-8 字符串?它是 ANSI 字符串或宽字符串(A 或 W 函数)。宽字符串每个字符有两个字节,UTF-8 字符串有一个(如果超过前 128 个 ASCII 字符,则有更多)。

UTF-8 in an Widestring just doesn't make sense. When there is really a Windows function that wants a pointer to an UTF-8 string, you probably have to cast is to a PAnsiChar.

Widestring 中的 UTF-8 没有意义。当确实有一个 Windows 函数需要指向 UTF-8 字符串的指针时,您可能必须将 is 转换为 PAnsiChar。