Linux 使用 iconv 将 UTF8 转换为 UTF16

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8923866/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 04:08:15  来源:igfitidea点击:

Convert UTF8 to UTF16 using iconv

linuxmacosunicodecommand-line

提问by PerfectGamesOnline.com

When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work. I have these files:

当我使用 iconv 从 UTF16 转换为 UTF8 时,一切都很好,但反之亦然,它不起作用。我有这些文件:

a-16.strings:    Little-endian UTF-16 Unicode c program text
a-8.strings:     UTF-8 Unicode c program text, with very long lines

The text look OK in editor. When I run this:

文本在编辑器中看起来不错。当我运行这个:

iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings

Then I get this result:

然后我得到这个结果:

b-16.strings:    data
a-16.strings:    Little-endian UTF-16 Unicode c program text
a-8.strings:     UTF-8 Unicode c program text, with very long lines

The fileutility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.

file实用程序不显示预期的文件格式,文本在编辑器中也不好看。难道 iconv 没有创建正确的 BOM?我在 MAC 命令行上运行它。

Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?

为什么 b-16 不是正确的 UTF-16LE 格式?还有另一种将 utf8 转换为 utf16 的方法吗?

More elaboration is bellow.

详细说明如下。

$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings
$ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings 
$ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings

$ file *s
a-16.strings:                   Little-endian UTF-16 Unicode c program text, with very long lines
a-8.strings:                    UTF-8 Unicode c program text, with very long lines
b-16be.strings:                 Big-endian UTF-16 Unicode c program text, with very long lines
b-16le-BAD-fromUTF16BE.strings: data
b-16le-BAD-fromUTF8.strings:    data


$ od -c a-16.strings | head
0000000  377 376   /  
( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE
*
iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null
$ file myfile.txt

myfile.txt: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
\f 001 E
$ file myfile.txt

myfile.txt: ASCII text, with very long lines
S
myfile.txt: ASCII text, with very long lines
K
myfile.txt: ASCII text, with very long lines
$ od -c a-8.strings | head 0000000 / * * * ? ** E S K Y ( J V O $ od -c b-16be.strings | head 0000000 376 377
iconv -f utf-8 -t utf-16 UTF-8-FILE > UTF-16-UNKNOWN-ENDIANNESS-FILE
FILE_ENCODING="$( file --brief --mime-encoding UTF-16-UNKNOWN-ENDIANNESS-FILE )"
iconv -f "$FILE_ENCODING" -t UTF-16LE UTF-16-UNKNOWN-ENDIANNESS-FILE > UTF-16-FILE
/ ##代码## * ##代码## * ##代码## * ##代码## 001 \f ##代码## E $ od -c b-16le-BAD-fromUTF16BE.strings | head 0000000 / ##代码## * ##代码## * ##代码## * ##代码## ##代码## \f 001 E ##代码## S ##代码## $ od -c b-16le-BAD-fromUTF8.strings | head 0000000 / ##代码## * ##代码## * ##代码## * ##代码## ##代码## \f 001 E ##代码## S ##代码##

It is clear the BOM is missing whenever I run conversion to UTF-16LE. Any help on this?

很明显,每当我运行到 UTF-16LE 的转换时,BOM 都会丢失。这有什么帮助吗?

采纳答案by Keith Thompson

UTF-16LEtells iconvto generate little-endian UTF-16 withouta BOM (Byte Order Mark). Apparently it assumes that since you specified LE, the BOM isn't necessary.

UTF-16LE告诉iconv生成小端UTF-16没有一个BOM(字节顺序标记)。显然,它假定由于您指定了LE,因此不需要 BOM。

UTF-16tells it to generate UTF-16 text (in the local machine's byte order) witha BOM.

UTF-16告诉它生成带有BOM 的UTF-16 文本(以本地机器的字节顺序)。

If you're on a little-endian machine, I don't see a way to tell iconvto generate big-endian UTF-16 with a BOM, but I might just be missing something.

如果您使用的是小端机器,我看不出有什么方法可以告诉iconv您使用 BOM 生成大端 UTF-16,但我可能只是遗漏了一些东西。

I find that the filecommand doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.

我发现该file命令无法识别没有 BOM 的 UTF-16 文本,您的编辑器也可能无法识别。但是,如果您运行iconv -f UTF-16LE -t UTF_8 b-16 strings,您应该获得原始文件的有效 UTF-8 版本。

Try running od -con the files to see their actual contents.

尝试运行od -c文件以查看其实际内容。

UPDATE :

更新 :

It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconvwon't do that directly. But this should work:

看起来您在大端机器上(x86 是小端),并且您正在尝试生成带有 BOM 的小端 UTF-16 文件。那是对的吗?据我所知,iconv不会直接这样做。但这应该有效:

##代码##

The behavior of the printfmightdepend on your locale settings; I have LANG=en_US.UTF-8.

的行为printf可能取决于您的语言环境设置;我有LANG=en_US.UTF-8

(Can anyone suggest a more elegant solution?)

(谁能提出更优雅的解决方案?)

Another workaround, ifyou know the endianness of the output produced by -t utf-16:

另一种解决方法,如果您知道由-t utf-16以下生成的输出的字节序:

##代码##

回答by Adams

This may not be an elegant solution but I found a manual way to ensure correct conversion for my problem which I believe is similar to the subject of this thread.

这可能不是一个优雅的解决方案,但我找到了一种手动方法来确保正确转换我的问题,我认为这类似于该线程的主题。

The Problem:I got a text datafile from a user and I was going to process it on Linux (specifically, Ubuntu) using shell script (tokenization, splitting, etc.). Let's call the file myfile.txt. The first indication that I got that something was amiss was that the tokenization was not working. So I was not surprised when I ran the filecommand on myfile.txtand got the following

问题:我从用户那里获得了一个文本数据文件,我打算在 Linux(特别是 Ubuntu)上使用 shell 脚本(标记化、拆分等)处理它。让我们调用文件myfile.txt。我发现有些不对劲的第一个迹象是标记化不起作用。所以当我运行file命令myfile.txt并得到以下结果时,我并不感到惊讶

##代码##

If the file was compliant, here is what should have been the conversation:

如果文件符合要求,那么对话内容应该是这样的:

##代码##

The Solution:To make the datafile compliant, below are the 3 manual steps that I found to work after some trial and error with other steps.

解决方案:为了使数据文件兼容,以下是我发现在对其他步骤进行一些反复试验后可以工作的 3 个手动步骤。

  1. First convert to Big Endian at the same encoding via vi(or vim). vi myfile.txt. In vido :set fileencoding=UTF-16BEthen write out the file. You may have to force it with :!wq.

  2. vi myfile.txt(which should now be in utf-16BE). In vido :set fileencoding=ASCIIthen write out the file. Again, you may have to force the write with !wq.

  3. Run dos2unixconverter: d2u myfile.txt. If you now run file myfile.txtyou should now see an output or something more familiar and assuring like:

    ##代码##
  1. 首先通过vi(或vim)以相同的编码转换为 Big Endian 。vi myfile.txt. 在vi:set fileencoding=UTF-16BE然后写出文件。您可能必须使用:!wq.

  2. vi myfile.txt(现在应该是 utf-16BE)。在vi:set fileencoding=ASCII然后写出文件。同样,您可能必须使用!wq.

  3. 运行dos2unix转换器:d2u myfile.txt. 如果您现在运行,file myfile.txt您现在应该会看到输出或更熟悉和更可靠的内容,例如:

    ##代码##

That's it. That's what worked for me, and I was then able to run my processing bash shell script of myfile.txt. I found that I cannot skip Step 2. That is, in this case I cannot skip directly to Step 3. Hopefully you can find this info useful; hopefully someone can automate it perhaps via sedor the like. Cheers.

就是这样。这对我有用,然后我就可以运行我的 .bash 处理 bash shell 脚本myfile.txt。我发现我不能跳过第 2 步。也就是说,在这种情况下,我不能直接跳到第 3 步。希望你能发现这些信息有用;希望有人可以通过sed或类似的方式将其自动化。干杯。

回答by Heath Borders

I first convert to UTF-16, which will prepend a byte-order mark, if necessary as Keith Thompson mentions. Then since UTF-16doesn't define endianness, we must use fileto determine whether it's UTF-16BEor UTF-16LE. Finally, we can convert to UTF-16LE.

我首先转换为UTF-16,这将在前面加上一个字节顺序标记,如有必要,如 Keith Thompson 提到的。然后由于UTF-16没有定义字节序,我们必须使用file来确定它是UTF-16BEUTF-16LE。最后,我们可以转换为UTF-16LE.

##代码##