Linux 如何使用 *nix 中的控制台工具将 \uXXXX unicode 转换为 UTF-8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8795702/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert \uXXXX unicode to UTF-8 using console tools in *nix
提问by Krzysztof Wolny
I use curl
to get some URL response, it's JSON response and it contains unicode-escaped national characters like \u0144 (ń)
and \u00f3 (ó)
.
我curl
用来获取一些 URL 响应,它是 JSON 响应,它包含 unicode 转义的国家字符,如\u0144 (ń)
和\u00f3 (ó)
。
How can I convert them to UTF-8or any other encoding to save into file?
如何将它们转换为UTF-8或任何其他编码以保存到文件中?
采纳答案by raphaelh
I don't know which distribution you are using, but uni2asciishould be included.
我不知道您使用的是哪个发行版,但应该包括uni2ascii。
$ sudo apt-get install uni2ascii
It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!
它只依赖于 libc6,所以它是一个轻量级的解决方案(uni2ascii i386 4.18-2 在 Ubuntu 上是 55,0 kB)!
Then to use it:
然后使用它:
$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó
回答by Kevin
Might be a bit ugly, but echo -e
should do it:
可能有点难看,但echo -e
应该这样做:
echo -en "$(curl $URL)"
-e
interprets escapes, -n
suppresses the newline echo
would normally add.
-e
解释转义,-n
抑制echo
通常会添加的换行符。
Note: The \u
escape works in the bash builtin echo
, but not /usr/bin/echo
.
注意:\u
转义在 bash 内置中有效echo
,但在/usr/bin/echo
.
As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).
正如评论中指出的,这是 bash 4.2+,而 4.2.x 有一个错误处理 0x00ff/17 值 (0x80-0xff)。
回答by Keith Thompson
Assuming the \u
is always followed by exactly 4 hex digits:
假设\u
总是紧跟 4 个十六进制数字:
#!/usr/bin/perl
use strict;
use warnings;
binmode(STDOUT, ':utf8');
while (<>) {
s/\u([0-9a-fA-F]{4})/chr(hex())/eg;
print;
}
The binmode
puts standard output into UTF-8 mode. The s...
command replaces each occurrence of \u
followed by 4 hex digits with the corresponding character. The e
suffix causes the replacement to be evaluated as an expression rather than treated as a string; the g
says to replace all occurrences rather than just the first.
在binmode
把标准输出为UTF-8模式。该s...
命令用\u
相应的字符替换每个出现的后跟 4 个十六进制数字。所述e
后缀导致更换,以作为表达,而不是作为一个字符串处理进行评估; 在g
说要全部替换,而不仅仅是第一。
You can save the above to a file somewhere in your $PATH
(don't forget the chmod +x
). It filters standard input (or one or more files named on the command line) to standard output.
您可以将上述内容保存到您的某个文件中$PATH
(不要忘记chmod +x
)。它将标准输入(或在命令行中命名的一个或多个文件)过滤到标准输出。
Again, this assumes that the representation is always \u
followed by exactly 4 hex digits. There are more Unicode characters than can be represented that way, but I'm assuming that \u12345
would denote the Unicode character 0x1234 (ETHIOPIC SYLLABLE SEE) followed by the digit 5
.
同样,这假设表示总是\u
紧跟 4 个十六进制数字。Unicode 字符比这种方式可以表示的要多,但我假设这\u12345
表示 Unicode 字符 0x1234(ETHIOPIC SYLLABLE SEE)后跟 digit 5
。
In C syntax, a universal-character-nameis either \u
followed by exactly 4 hex digits, or \U
followed by exactly 8 hexadecimal digits. I don't know whether your JSON responses use the same scheme. You should probably find out how (or whether) it encodes Unicode characters outside the Basic Multilingual Plane (the first 216characters).
在C语法,一个通用的字符名称要么\u
后紧跟4位十六进制数字,或\U
后紧跟8位十六进制数字。我不知道您的 JSON 响应是否使用相同的方案。您可能应该了解它如何(或是否)对基本多语言平面之外的 Unicode 字符(前 2 16 个字符)进行编码。
回答by Thanatos
Don't rely on regexes: JSON has some strange corner-cases with \u
escapes and non-BMP code points. (specifically, JSON will encode one code-point using two\u
escapes) If you assume 1 escape sequence translates to 1 code point, you're doomed on such text.
不要依赖正则表达式:JSON 有一些奇怪的\u
转义和非 BMP 代码点的极端情况。(具体来说,JSON 将使用两个\u
转义符对一个代码点进行编码)如果您假设 1 个转义序列转换为 1 个代码点,那么您注定会遇到这样的文本。
Using a full JSON parser from the language of your choice is considerably more robust:
使用您选择的语言的完整 JSON 解析器要健壮得多:
$ echo '["foo bar \u0144\n"]' | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'
That's really just feeding the data to this short python script:
这实际上只是将数据提供给这个简短的 Python 脚本:
import json
import sys
data = json.load(sys.stdin)
data = data[0] # change this to find your string in the JSON
sys.stdout.write(data.encode('utf-8'))
From which you can save as foo.py
and call as curl ... | foo.py
您可以从中另存为foo.py
和调用为curl ... | foo.py
An example that will break most of the other attempts in this question is "\ud83d\udca3"
:
将打破此问题中大多数其他尝试的示例是"\ud83d\udca3"
:
% printf '"\ud83d\udca3"' | python2 -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'; echo
# echo will result in corrupt output:
% echo -e $(printf '"\ud83d\udca3"')
"??????"
# native2ascii won't even try (this is correct for its intended use case, however, just not ours):
% printf '"\ud83d\udca3"' | native2ascii -encoding utf-8 -reverse
"\ud83d\udca3"
回答by Krzysztof Wolny
I found native2ascii from JDK as the best way to do it:
我发现 JDK 中的 native2ascii 是最好的方法:
native2ascii -encoding UTF-8 -reverse src.txt dest.txt
Detailed description is here: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html
详细说明在这里:http: //docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html
Update:No longer available since JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431
更新:自 JDK9 起不再可用:https://bugs.openjdk.java.net/browse/JDK-8074431
回答by andrej
use /usr/bin/printf "\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima"
to get proper unicode-to-utf8 conversion.
用于/usr/bin/printf "\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima"
获得正确的 unicode 到 utf8 转换。
回答by Tanguy
iconv -f Unicode fullOrders.csv > fullOrders-utf8.csv
回答by Smit Johnth
now I have the best answer! Use jq
现在我有了最好的答案!使用jq
Windows:
视窗:
type in.json | jq > out.json
Lunix:
卢尼克斯:
cat in.json | jq > out.json
It's surely faster as any answer using perl/python. Without parameters it formats the json and converts \uXXXX to utf8. It can be used to do json queries too. Very nice tool!
它肯定比使用 perl/python 的任何答案都要快。如果没有参数,它会格式化 json 并将 \uXXXX 转换为 utf8。它也可用于进行 json 查询。非常不错的工具!
回答by Robin A. Meade
Use the b
conversion specifier mandated by POSIX:
使用b
POSIX 规定的转换说明符:
An additional conversion specifier character,
b
, shall be supported as follows. The argument shall be taken to be a string that can contain backslash-escape sequences.
— http://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html
b
应支持附加的转换说明符字符 ,如下所示。参数应被视为可以包含反斜杠转义序列的字符串。
— http://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html
expand_escape_sequences() {
printf %b ""
}
Test:
测试:
s='\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima A percent sign % OK?'
expand_escape_sequences "$s"
# output: ?ini?i Ho?i - A? sa skon?í zima A percent sign % OK?
NOTE: If you remove the %b
format specifier, the percent sign will cause error like:
注意:如果删除%b
格式说明符,百分号将导致如下错误:
-bash: printf: `O': invalid format character
Tested successfully with both bash's builtin printf
and /usr/bin/printf
on my linux distribution (Fedora 29).
使用 bash 的内置printf
和/usr/bin/printf
我的 linux 发行版(Fedora 29)成功测试。
UPDATE 2019-04-17: My solution assumed unicode escapes like \uxxxx
and \Uxxxxxxxx
; the latter being required for unicode characters beyond the BMP. However, the OP's question involved a JSON stream. JSON's unicode escape sequences use UTF16, which require surrogate pairs beyond the BMP.
2019 年 4 月 17 日更新:我的解决方案假定 unicode 转义像\uxxxx
和\Uxxxxxxxx
;后者是 BMP 之外的 unicode 字符所必需的。但是,OP 的问题涉及 JSON 流。JSON 的 unicode 转义序列使用 UTF16,这需要 BMP 之外的代理对。
Consider unicode character ('GRINNING FACE WITH SMILING EYES' (U+1F601)). The \U
escape sequence for this character is: \U0001F601
. You can print it using the POSIX mandated %b
specifier like so:
考虑 unicode 字符('GRINNING FACE WITH SILING EYES' (U+1F601))。\U
此字符的转义序列是:\U0001F601
。您可以使用 POSIX 强制%b
说明符打印它,如下所示:
printf %b '\U0001F601'
# Prints as expected
However, in JSON the escape sequence for this character involves a UTF16 surrogate pair: \uD83D\uDE01
但是,在 JSON 中,此字符的转义序列涉及 UTF16 代理对: \uD83D\uDE01
For manipulating JSON streams at the shell level, the jq
tool is superb:
对于在 shell 级别操作 JSON 流,该jq
工具非常棒:
echo '["\uD83D\uDE01"]' | jq .
# Prints [""] as expected
Thus I now withdraw my answer from consideration and endorse Smit Johnth's answer of using jq
as the best answer.
因此,我现在不再考虑我的答案,并赞同 Smit Johnth 将其jq
用作最佳答案的答案。
回答by Kay Marquardt
Preface:None of the promoted answers to this question solved a longstanding issuein telegram-bot-bash. Only the python solution from Thanatosworked!
This is because JSON will encode one code-point using two \u escapes
前言:对这个问题的推荐答案都没有解决电报机器人bash中长期存在的问题。只有来自Thanatos的 python 解决方案有效!
这是因为 JSON 将使用两个 \u 转义符对一个代码点进行编码
Here you'll find twodrop in replacements for echo -e
and printf '%s'
在这里,你会发现2个的替代降的echo -e
和printf '%s'
PUREbash variant as a function. paste on the top of your script and use it to decode your JSON strings in bash:
作为函数的纯bash 变体。粘贴到脚本的顶部并使用它来解码 bash 中的 JSON 字符串:
#!/bin/bash
#
# pure bash implementaion, done by KayM (@gnadelwartz)
# see https://stackoverflow.com/a/55666449/9381171
JsonDecode() {
local out=""
local remain=""
local regexp='(.*)\u[dD]([0-9a-fA-F]{3})\u[dD]([0-9a-fA-F]{3})(.*)'
while [[ "${out}" =~ $regexp ]] ; do
# match 2 \udxxx hex values, calculate new U, then split and replace
local W1="$(( ( 0xd${BASH_REMATCH[2]} & 0x3ff) <<10 ))"
local W2="$(( 0xd${BASH_REMATCH[3]} & 0x3ff ))"
U="$(( ( W1 | W2 ) + 0x10000 ))"
remain="$(printf '\U%8.8x' "${U}")${BASH_REMATCH[4]}${remain}"
out="${BASH_REMATCH[1]}"
done
echo -e "${out}${remain}"
}
# Some tests ===============
$ JsonDecode 'xxx \ud83d\udc25 xxxx' -> xxx xxxx
$ JsonDecode '\ud83d\udc25' ->
$ JsonDecode '\u00e4 \u00e0 \u00f6 \u00f4 \u00fc \u00fb \ud83d\ude03 \ud83d\ude1a \ud83d\ude01 \ud83d\ude02 \ud83d\udc7c \ud83d\ude49 \ud83d\udc4e \ud83d\ude45 \ud83d\udc5d \ud83d\udc28 \ud83d\udc25 \ud83d\udc33 \ud83c\udf0f \ud83c\udf89 \ud83d\udcfb \ud83d\udd0a \ud83d\udcec \u2615 \ud83c\udf51'
? à ? ? ü ? ?
# decode 100x string with 25 JSON UTF-16 vaules
$ time for x in $(seq 1 100); do JsonDecode '\u00e4 \u00e0 \u00f6 \u00f4 \u00fc \u00fb \ud83d\ude03 \ud83d\ude1a \ud83d\ude01 \ud83d\ude02 \ud83d\udc7c \ud83d\ude49 \ud83d\udc4e \ud83d\ude45 \ud83d\udc5d \ud83d\udc28 \ud83d\udc25 \ud83d\udc33 \ud83c\udf0f \ud83c\udf89 \ud83d\udcfb \ud83d\udd0a \ud83d\udcec \u2615 \ud83c\udf51' >/dev/null ; done
real 0m2,195s
user 0m1,635s
sys 0m0,647s
MIXEDsolution with Phyton variant from Thanatos:
来自 Thanatos 的 Phyton 变体的混合解决方案:
# usage: JsonDecode "your bash string containing \uXXXX extracted from JSON"
JsonDecode() {
# wrap string in "", replace " by \"
printf '"%s\n"' "${1//\"/\\"}" |\
python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin).encode("utf-8"))'
}
-
——
Testcase for the ones who advocate the other promoted soutions will work:
那些提倡其他推广方案的人的测试用例将起作用:
# test=' ?? ' from JSON
$ export test='\uD83D\uDE01 \uD83D\uDE18 \u2764\uFE0F \uD83D\uDE0A \uD83D\uDC4D'
$ printf '"%s\n"' "${test}" | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin).encode("utf-8"))' >phyton.txt
$ echo -e "$test" >echo.txt
$ cat -v phyton.txt
M-pM-^_M-^XM-^A M-pM-^_M-^XM-^X M-bM-^]M-$M-oM-8M-^O M-pM-^_M-^XM-^J M-pM-^_M-^QM-^M
$ cat -v echo.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M
As you can easily see the output is different. the other promoted solutions provide the same wrong output for JSON strings as echo -e
:
你可以很容易地看到输出是不同的。其他推广的解决方案为 JSON 字符串提供相同的错误输出echo -e
:
$ ascii2uni -a U -q >uni2ascii.txt <<EOF
$test
EOF
$ cat -v uni2ascii.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M
$ printf "$test\n" >printf.txt
$ cat -v printf.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M
$ echo "$test" | iconv -f Unicode >iconf.txt
$ cat -v iconf.txt
M-gM-^UM-^\M-cM-!M-^DM-dM-^PM-3M-gM-^UM-^\M-dM-^UM-^DM-cM-^DM-0M-eM-0M- M-dM-^QM-5M-cM-^LM-8M-eM-1M-^DM-dM-^QM-5M-cM-^EM-^EM-bM-^@M-8M-gM-^UM-^\M-cM-^\M-2M-cM-^PM-6M-gM-^UM-^\M-dM-^UM-^FM-dM-^XM-0M-eM-0M- M-dM-^QM-5M-cM-^LM-8M-eM-1M-^DM-dM-^QM-5M-cM-^AM-^EM-bM-^AM-^AM-gM-^UM-^\M-cM-!M-^DM-dM-^PM-3M-gM-^UM-^\M-dM-^MM-^DM-dM-^PM-4r