Linux 如何使用 *nix 中的控制台工具将 \uXXXX unicode 转换为 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8795702/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 04:00:13  来源:igfitidea点击:

How to convert \uXXXX unicode to UTF-8 using console tools in *nix

linuxjsonunixunicodeencoding

提问by Krzysztof Wolny

I use curlto get some URL response, it's JSON response and it contains unicode-escaped national characters like \u0144 (ń)and \u00f3 (ó).

curl用来获取一些 URL 响应,它是 JSON 响应,它包含 unicode 转义的国家字符,如\u0144 (ń)\u00f3 (ó)

How can I convert them to UTF-8or any other encoding to save into file?

如何将它们转换为UTF-8或任何其他编码以保存到文件中?

采纳答案by raphaelh

I don't know which distribution you are using, but uni2asciishould be included.

我不知道您使用的是哪个发行版,但应该包括uni2ascii

$ sudo apt-get install uni2ascii

It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!

它只依赖于 libc6,所以它是一个轻量级的解决方案(uni2ascii i386 4.18-2 在 Ubuntu 上是 55,0 kB)!

Then to use it:

然后使用它:

$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó

回答by Kevin

Might be a bit ugly, but echo -eshould do it:

可能有点难看,但echo -e应该这样做:

echo -en "$(curl $URL)"

-einterprets escapes, -nsuppresses the newline echowould normally add.

-e解释转义,-n抑制echo通常会添加的换行符。

Note: The \uescape works in the bash builtin echo, but not /usr/bin/echo.

注意:\u转义在 bash 内置中有效echo,但在/usr/bin/echo.

As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).

正如评论中指出的,这是 bash 4.2+,而 4.2.x 有一个错误处理 0x00ff/17 值 (0x80-0xff)。

回答by Keith Thompson

Assuming the \uis always followed by exactly 4 hex digits:

假设\u总是紧跟 4 个十六进制数字:

#!/usr/bin/perl

use strict;
use warnings;

binmode(STDOUT, ':utf8');

while (<>) {
    s/\u([0-9a-fA-F]{4})/chr(hex())/eg;
    print;
}

The binmodeputs standard output into UTF-8 mode. The s...command replaces each occurrence of \ufollowed by 4 hex digits with the corresponding character. The esuffix causes the replacement to be evaluated as an expression rather than treated as a string; the gsays to replace all occurrences rather than just the first.

binmode把标准输出为UTF-8模式。该s...命令用\u相应的字符替换每个出现的后跟 4 个十六进制数字。所述e后缀导致更换,以作为表达,而不是作为一个字符串处理进行评估; 在g说要全部替换,而不仅仅是第一。

You can save the above to a file somewhere in your $PATH(don't forget the chmod +x). It filters standard input (or one or more files named on the command line) to standard output.

您可以将上述内容保存到您的某个文件中$PATH(不要忘记chmod +x)。它将标准输入(或在命令行中命名的一个或多个文件)过滤到标准输出。

Again, this assumes that the representation is always \ufollowed by exactly 4 hex digits. There are more Unicode characters than can be represented that way, but I'm assuming that \u12345would denote the Unicode character 0x1234 (ETHIOPIC SYLLABLE SEE) followed by the digit 5.

同样,这假设表示总是\u紧跟 4 个十六进制数字。Unicode 字符比这种方式可以表示的要多,但我假设这\u12345表示 Unicode 字符 0x1234(ETHIOPIC SYLLABLE SEE)后跟 digit 5

In C syntax, a universal-character-nameis either \ufollowed by exactly 4 hex digits, or \Ufollowed by exactly 8 hexadecimal digits. I don't know whether your JSON responses use the same scheme. You should probably find out how (or whether) it encodes Unicode characters outside the Basic Multilingual Plane (the first 216characters).

在C语法,一个通用的字符名称要么\u后紧跟4位十六进制数字,或\U后紧跟8位十六进制数字。我不知道您的 JSON 响应是否使用相同的方案。您可能应该了解它如何(或是否)对基本多语言平面之外的 Unicode 字符(前 2 16 个字符)进行编码。

回答by Thanatos

Don't rely on regexes: JSON has some strange corner-cases with \uescapes and non-BMP code points. (specifically, JSON will encode one code-point using two\uescapes) If you assume 1 escape sequence translates to 1 code point, you're doomed on such text.

不要依赖正则表达式:JSON 有一些奇怪的\u转义和非 BMP 代码点的极端情况。(具体来说,JSON 将使用两个\u转义符对一个代码点进行编码)如果您假设 1 个转义序列转换为 1 个代码点,那么您注定会遇到这样的文本。

Using a full JSON parser from the language of your choice is considerably more robust:

使用您选择的语言的完整 JSON 解析器要健壮得多:

$ echo '["foo bar \u0144\n"]' | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'

That's really just feeding the data to this short python script:

这实际上只是将数据提供给这个简短的 Python 脚本:

import json
import sys

data = json.load(sys.stdin)
data = data[0] # change this to find your string in the JSON
sys.stdout.write(data.encode('utf-8'))

From which you can save as foo.pyand call as curl ... | foo.py

您可以从中另存为foo.py和调用为curl ... | foo.py

An example that will break most of the other attempts in this question is "\ud83d\udca3":

将打破此问题中大多数其他尝试的示例是"\ud83d\udca3"

% printf '"\ud83d\udca3"' | python2 -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'; echo

# echo will result in corrupt output:
% echo -e $(printf '"\ud83d\udca3"') 
"??????"
# native2ascii won't even try (this is correct for its intended use case, however, just not ours):
% printf '"\ud83d\udca3"' | native2ascii -encoding utf-8 -reverse
"\ud83d\udca3"

回答by Krzysztof Wolny

I found native2ascii from JDK as the best way to do it:

我发现 JDK 中的 native2ascii 是最好的方法:

native2ascii -encoding UTF-8 -reverse src.txt dest.txt

Detailed description is here: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html

详细说明在这里:http: //docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html

Update:No longer available since JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431

更新:自 JDK9 起不再可用:https://bugs.openjdk.java.net/browse/JDK-8074431

回答by andrej

use /usr/bin/printf "\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima"to get proper unicode-to-utf8 conversion.

用于/usr/bin/printf "\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima"获得正确的 unicode 到 utf8 转换。

回答by Tanguy

iconv -f Unicode fullOrders.csv > fullOrders-utf8.csv

回答by Smit Johnth

now I have the best answer! Use jq

现在我有了最好的答案!使用jq

Windows:

视窗:

type in.json | jq > out.json

Lunix:

卢尼克斯:

cat in.json | jq > out.json

It's surely faster as any answer using perl/python. Without parameters it formats the json and converts \uXXXX to utf8. It can be used to do json queries too. Very nice tool!

它肯定比使用 perl/python 的任何答案都要快。如果没有参数,它会格式化 json 并将 \uXXXX 转换为 utf8。它也可用于进行 json 查询。非常不错的工具!

回答by Robin A. Meade

Use the bconversion specifier mandated by POSIX:

使用bPOSIX 规定的转换说明符:

An additional conversion specifier character, b, shall be supported as follows. The argument shall be taken to be a string that can contain backslash-escape sequences.
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html

b应支持附加的转换说明符字符 ,如下所示。参数应被视为可以包含反斜杠转义序列的字符串。
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html

expand_escape_sequences() {
  printf %b ""
}

Test:

测试:

s='\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima A percent sign % OK?'
expand_escape_sequences "$s"

# output: ?ini?i Ho?i - A? sa skon?í zima A percent sign % OK?

NOTE: If you remove the %bformat specifier, the percent sign will cause error like:

注意:如果删除%b格式说明符,百分号将导致如下错误:

-bash: printf: `O': invalid format character

Tested successfully with both bash's builtin printfand /usr/bin/printfon my linux distribution (Fedora 29).

使用 bash 的内置printf/usr/bin/printf我的 linux 发行版(Fedora 29)成功测试。



UPDATE 2019-04-17: My solution assumed unicode escapes like \uxxxxand \Uxxxxxxxx; the latter being required for unicode characters beyond the BMP. However, the OP's question involved a JSON stream. JSON's unicode escape sequences use UTF16, which require surrogate pairs beyond the BMP.

2019 年 4 月 17 日更新:我的解决方案假定 unicode 转义像\uxxxx\Uxxxxxxxx;后者是 BMP 之外的 unicode 字符所必需的。但是,OP 的问题涉及 JSON 流。JSON 的 unicode 转义序列使用 UTF16,这需要 BMP 之外的代理对。

Consider unicode character ('GRINNING FACE WITH SMILING EYES' (U+1F601)). The \Uescape sequence for this character is: \U0001F601. You can print it using the POSIX mandated %bspecifier like so:

考虑 unicode 字符('GRINNING FACE WITH SILING EYES' (U+1F601))。\U此字符的转义序列是:\U0001F601。您可以使用 POSIX 强制%b说明符打印它,如下所示:

printf %b '\U0001F601'
# Prints  as expected

However, in JSON the escape sequence for this character involves a UTF16 surrogate pair: \uD83D\uDE01

但是,在 JSON 中,此字符的转义序列涉及 UTF16 代理对: \uD83D\uDE01

For manipulating JSON streams at the shell level, the jqtool is superb:

对于在 shell 级别操作 JSON 流,该jq工具非常棒:

echo '["\uD83D\uDE01"]' | jq .
# Prints [""] as expected 

Thus I now withdraw my answer from consideration and endorse Smit Johnth's answer of using jqas the best answer.

因此,我现在不再考虑我的答案,并赞同 Smit Johnth 将其jq用作最佳答案的答案。

回答by Kay Marquardt

Preface:None of the promoted answers to this question solved a longstanding issuein telegram-bot-bash. Only the python solution from Thanatosworked!

This is because JSON will encode one code-point using two \u escapes

前言:对这个问题的推荐答案都没有解决电报机器人bash中长期存在的问题。只有来自Thanatos的 python 解决方案有效!

这是因为 JSON 将使用两个 \u 转义符对一个代码点进行编码



Here you'll find twodrop in replacements for echo -eand printf '%s'

在这里,你会发现2个的替代降的echo -eprintf '%s'

PUREbash variant as a function. paste on the top of your script and use it to decode your JSON strings in bash:

作为函数的bash 变体。粘贴到脚本的顶部并使用它来解码 bash 中的 JSON 字符串:

#!/bin/bash
#
# pure bash implementaion, done by KayM (@gnadelwartz)
# see https://stackoverflow.com/a/55666449/9381171
  JsonDecode() {
     local out=""
     local remain=""   
     local regexp='(.*)\u[dD]([0-9a-fA-F]{3})\u[dD]([0-9a-fA-F]{3})(.*)'
     while [[ "${out}" =~ $regexp ]] ; do
           # match 2 \udxxx hex values, calculate new U, then split and replace
           local W1="$(( ( 0xd${BASH_REMATCH[2]} & 0x3ff) <<10 ))"
           local W2="$(( 0xd${BASH_REMATCH[3]} & 0x3ff ))"
           U="$(( ( W1 | W2 ) + 0x10000 ))"
           remain="$(printf '\U%8.8x' "${U}")${BASH_REMATCH[4]}${remain}"
           out="${BASH_REMATCH[1]}"
     done
     echo -e "${out}${remain}"
  }

# Some tests ===============
$ JsonDecode 'xxx \ud83d\udc25 xxxx' -> xxx  xxxx
$ JsonDecode '\ud83d\udc25' -> 
$ JsonDecode '\u00e4 \u00e0 \u00f6 \u00f4 \u00fc \u00fb \ud83d\ude03 \ud83d\ude1a \ud83d\ude01 \ud83d\ude02 \ud83d\udc7c \ud83d\ude49 \ud83d\udc4e \ud83d\ude45 \ud83d\udc5d \ud83d\udc28 \ud83d\udc25 \ud83d\udc33 \ud83c\udf0f \ud83c\udf89 \ud83d\udcfb \ud83d\udd0a \ud83d\udcec \u2615 \ud83c\udf51'
? à ? ? ü ?                  ? 

# decode 100x string with 25 JSON UTF-16 vaules
$ time for x in $(seq 1 100); do JsonDecode '\u00e4 \u00e0 \u00f6 \u00f4 \u00fc \u00fb \ud83d\ude03 \ud83d\ude1a \ud83d\ude01 \ud83d\ude02 \ud83d\udc7c \ud83d\ude49 \ud83d\udc4e \ud83d\ude45 \ud83d\udc5d \ud83d\udc28 \ud83d\udc25 \ud83d\udc33 \ud83c\udf0f \ud83c\udf89 \ud83d\udcfb \ud83d\udd0a \ud83d\udcec \u2615 \ud83c\udf51' >/dev/null ; done

real    0m2,195s
user    0m1,635s
sys     0m0,647s

MIXEDsolution with Phyton variant from Thanatos:

来自 Thanatos 的 Phyton 变体的混合解决方案:

# usage: JsonDecode "your bash string containing \uXXXX extracted from JSON"
 JsonDecode() {
     # wrap string in "", replace " by \"
     printf '"%s\n"' "${1//\"/\\"}" |\
     python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin).encode("utf-8"))'
 }

-

——



Testcase for the ones who advocate the other promoted soutions will work:

那些提倡其他推广方案的人的测试用例将起作用:

# test='  ??  ' from JSON
$ export test='\uD83D\uDE01 \uD83D\uDE18 \u2764\uFE0F \uD83D\uDE0A \uD83D\uDC4D'

$ printf '"%s\n"' "${test}" | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin).encode("utf-8"))' >phyton.txt
$ echo -e "$test" >echo.txt

$ cat -v phyton.txt
M-pM-^_M-^XM-^A M-pM-^_M-^XM-^X M-bM-^]M-$M-oM-8M-^O M-pM-^_M-^XM-^J M-pM-^_M-^QM-^M

$ cat -v echo.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M

As you can easily see the output is different. the other promoted solutions provide the same wrong output for JSON strings as echo -e:

你可以很容易地看到输出是不同的。其他推广的解决方案为 JSON 字符串提供相同的错误输出echo -e

$ ascii2uni -a U -q >uni2ascii.txt <<EOF
$test
EOF

$ cat -v uni2ascii.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M

$ printf "$test\n" >printf.txt
$ cat -v printf.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M

$ echo "$test" | iconv -f Unicode >iconf.txt                                                                                     

$ cat -v iconf.txt
M-gM-^UM-^\M-cM-!M-^DM-dM-^PM-3M-gM-^UM-^\M-dM-^UM-^DM-cM-^DM-0M-eM-0M- M-dM-^QM-5M-cM-^LM-8M-eM-1M-^DM-dM-^QM-5M-cM-^EM-^EM-bM-^@M-8M-gM-^UM-^\M-cM-^\M-2M-cM-^PM-6M-gM-^UM-^\M-dM-^UM-^FM-dM-^XM-0M-eM-0M- M-dM-^QM-5M-cM-^LM-8M-eM-1M-^DM-dM-^QM-5M-cM-^AM-^EM-bM-^AM-^AM-gM-^UM-^\M-cM-!M-^DM-dM-^PM-3M-gM-^UM-^\M-dM-^MM-^DM-dM-^PM-4r