Javascript 如何使用转义的 unicode 解码字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7885096/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 03:57:07  来源:igfitidea点击:

How do I decode a string with escaped unicode?

javascriptdecodeurldecode

提问by styfle

I'm not sure what this is called so I'm having trouble searching for it. How can I decode a string with unicode from http\u00253A\u00252F\u00252Fexample.comto http://example.comwith JavaScript? I tried unescape, decodeURI, and decodeURIComponentso I guess the only thing left is string replace.

我不确定这叫什么,所以我在寻找它时遇到了麻烦。如何使用 JavaScript从http\u00253A\u00252F\u00252Fexample.comto解码带有 unicode 的字符串http://example.com?我试过unescape, decodeURIdecodeURIComponent所以我想唯一剩下的就是字符串替换。

EDIT: The string is not typed, but rather a substring from another piece of code. So to solve the problem you have to start with something like this:

编辑:该字符串不是输入的,而是来自另一段代码的子字符串。所以要解决这个问题,你必须从这样的事情开始:

var s = 'http\u00253A\u00252F\u00252Fexample.com';

I hope that shows why unescape() doesn't work.

我希望这能说明为什么 unescape() 不起作用。

回答by Ioannis Karadimas

UPDATE: Please note that this is a solution that should apply to older browsers or non-browser platforms, and is kept alive for instructional purposes. Please refer to @radicand 's answer below for a more up to date answer.

更新:请注意,这是一个适用于旧浏览器或非浏览器平台的解决方案,并且为了教学目的而保持活动状态。请参阅下面@radicand 的答案以获取最新答案。



This is a unicode, escaped string. First the string was escaped, then encoded with unicode. To convert back to normal:

这是一个 Unicode 转义字符串。首先字符串被转义,然后用unicode编码。转换回正常:

var x = "http\u00253A\u00252F\u00252Fexample.com";
var r = /\u([\d\w]{4})/gi;
x = x.replace(r, function (match, grp) {
    return String.fromCharCode(parseInt(grp, 16)); } );
console.log(x);  // http%3A%2F%2Fexample.com
x = unescape(x);
console.log(x);  // http://example.com

To explain: I use a regular expression to look for \u0025. However, since I need only a part of this string for my replace operation, I use parentheses to isolate the part I'm going to reuse, 0025. This isolated part is called a group.

解释一下:我使用正则表达式来查找\u0025. 但是,因为我的替换操作只需要这个字符串的一部分,所以我使用括号来隔离我要重用的部分,0025. 这个孤立的部分称为一个组。

The gipart at the end of the expression denotes it should match all instances in the string, not just the first one, and that the matching should be case insensitive. This might look unnecessary given the example, but it adds versatility.

gi表达式末尾的部分表示它应该匹配字符串中的所有实例,而不仅仅是第一个,并且匹配应该不区分大小写。鉴于示例,这可能看起来没有必要,但它增加了多功能性。

Now, to convert from one string to the next, I need to execute some steps on each group of each match, and I can't do that by simply transforming the string. Helpfully, the String.replace operation can accept a function, which will be executed for each match. The return of that function will replace the match itself in the string.

现在,要从一个字符串转换为下一个字符串,我需要对每个匹配项的每一组执行一些步骤,而我不能通过简单地转换字符串来做到这一点。有用的是,String.replace 操作可以接受一个函数,该函数将为每个匹配项执行。该函数的返回将替换字符串中的匹配项本身。

I use the second parameter this function accepts, which is the group I need to use, and transform it to the equivalent utf-8 sequence, then use the built - in unescapefunction to decode the string to its proper form.

我使用该函数接受的第二个参数,即我需要使用的组,并将其转换为等效的 utf-8 序列,然后使用内置unescape函数将字符串解码为正确的形式。

回答by radicand

Edit (2017-10-12):

编辑(2017-10-12)

@MechaLynx and @Kevin-Weber note that unescape()is deprecated from non-browser environments and does not exist in TypeScript. decodeURIComponentis a drop-in replacement. For broader compatibility, use the below instead:

@MechaLynx 和 @Kevin-Weber 指出,unescape()在非浏览器环境中已弃用,并且在 TypeScript 中不存在。decodeURIComponent是一种替代品。为了更广泛的兼容性,请改用以下内容:

decodeURIComponent(JSON.parse('"http\u00253A\u00252F\u00252Fexample.com"'));
> 'http://example.com'

Original answer:

原答案:

unescape(JSON.parse('"http\u00253A\u00252F\u00252Fexample.com"'));
> 'http://example.com'

You can offload all the work to JSON.parse

您可以将所有工作卸载到 JSON.parse

回答by Kevin Weber

Note that the use of unescape()is deprecatedand doesn't work with the TypeScript compiler, for example.

请注意,使用的unescape()过时和不与打字稿编译工作,例如。

Based on radicand's answer and the comments section below, here's an updated solution:

根据 radicand 的回答和下面的评论部分,这里有一个更新的解决方案:

var string = "http\u00253A\u00252F\u00252Fexample.com";
decodeURIComponent(JSON.parse('"' + string.replace(/\"/g, '\"') + '"'));

http://example.com

http://example.com

回答by aamarks

I don't have enough rep to put this under comments to the existing answers:

我没有足够的代表将其放在对现有答案的评论中:

unescapeis only deprecated for working with URIs (or any encoded utf-8) which is probably the case for most people's needs. encodeURIComponentconverts a js string to escaped UTF-8 and decodeURIComponentonly works on escaped UTF-8 bytes. It throws an error for something like decodeURIComponent('%a9'); // errorbecause extended ascii isn't valid utf-8 (even though that's still a unicode value), whereas unescape('%a9'); // ?So you need to know your data when using decodeURIComponent.

unescape仅在使用 URI(或任何编码的 utf-8)时才被弃用,这可能是大多数人需要的情况。encodeURIComponent将 js 字符串转换为转义的 UTF-8,并且decodeURIComponent仅适用于转义的 UTF-8 字节。它抛出一个错误,decodeURIComponent('%a9'); // error因为扩展的 ascii 不是有效的 utf-8(即使它仍然是一个 unicode 值),而unescape('%a9'); // ?因此在使用 decodeURIComponent 时你需要知道你的数据。

decodeURIComponent won't work on "%C2"or any lone byte over 0x7fbecause in utf-8 that indicates part of a surrogate. However decodeURIComponent("%C2%A9") //gives you ?Unescape wouldn't work properly on that // ??AND it wouldn't throw an error, so unescape can lead to buggy code if you don't know your data.

decodeURIComponent 将无法处理"%C2"或任何单独的字节,0x7f因为在 utf-8 中表示代理的一部分。然而,decodeURIComponent("%C2%A9") //gives you ?Unescape 无法正常工作// ??并且不会抛出错误,因此如果您不知道自己的数据,则 unescape 可能会导致错误代码。

回答by Ian

Using JSON.decodefor this comes with significant drawbacks that you must be aware of:

使用JSON.decode此功能会带来您必须注意的重大缺点:

  • You must wrap the string in double quotes
  • Many characters are not supported and must be escaped themselves. For example, passing any of the following to JSON.decode(after wrapping them in double quotes) will error even though these are all valid: \\n, \n, \\0, a"a
  • It does not support hexadecimal escapes: \\x45
  • It does not support Unicode code point sequences: \\u{045}
  • 您必须将字符串用双引号括起来
  • 许多字符不受支持,必须自己转义。例如,将以下任何内容传递给JSON.decode(将它们用双引号括起来后)将出错,即使这些都是有效的:\\n, \n, \\0,a"a
  • 它不支持十六进制转义: \\x45
  • 它不支持 Unicode 代码点序列: \\u{045}

There are other caveats as well. Essentially, using JSON.decodefor this purpose is a hack and doesn't work the way you might always expect. You should stick with using the JSONlibrary to handle JSON, not for string operations.

还有其他注意事项。从本质上讲,JSON.decode用于此目的是一种黑客行为,并不像您一直期望的那样工作。您应该坚持使用该JSON库来处理 JSON,而不是用于字符串操作。



I recently ran into this issue myself and wanted a robust decoder, so I ended up writing one myself. It's complete and thoroughly tested and is available here: https://github.com/iansan5653/unraw. It mimics the JavaScript standard as closely as possible.

我最近自己遇到了这个问题,想要一个强大的解码器,所以我最终自己写了一个。它是完整且经过彻底测试的,可在此处获得:https: //github.com/iansan5653/unraw。它尽可能地模仿 JavaScript 标准。

Explanation:

解释:

The source is about 250 lines so I won't include it all here, but essentially it uses the following Regex to find all escape sequences and then parses them using parseInt(string, 16)to decode the base-16 numbers and then String.fromCodePoint(number)to get the corresponding character:

源代码大约有 250 行,所以我不会在这里全部包含它,但本质上它使用以下正则表达式来查找所有转义序列,然后使用parseInt(string, 16)解码 base-16 数字来解析它们,然后String.fromCodePoint(number)获取相应的字符:

/\(?:(\)|x([\s\S]{0,2})|u(\{[^}]*\}?)|u([\s\S]{4})\u([^{][\s\S]{0,3})|u([\s\S]{0,4})|([0-3]?[0-7]{1,2})|([\s\S])|$)/g

Commented (NOTE: This regex matches all escape sequences, including invalid ones. If the string would throw an error in JS, it throws an error in my library [ie, '\x!!'will error]):

注释(注意:此正则表达式匹配所有转义序列,包括无效的转义序列。如果字符串会在 JS 中引发错误,'\x!!'则会在我的库中引发错误 [即,会出错]):

/
\ # All escape sequences start with a backslash
(?: # Starts a group of 'or' statements
(\) # If a second backslash is encountered, stop there (it's an escaped slash)
| # or
x([\s\S]{0,2}) # Match valid hexadecimal sequences
| # or
u(\{[^}]*\}?) # Match valid code point sequences
| # or
u([\s\S]{4})\u([^{][\s\S]{0,3}) # Match surrogate code points which get parsed together
| # or
u([\s\S]{0,4}) # Match non-surrogate Unicode sequences
| # or
([0-3]?[0-7]{1,2}) # Match deprecated octal sequences
| # or
([\s\S]) # Match anything else ('.' doesn't match newlines)
| # or
$ # Match the end of the string
) # End the group of 'or' statements
/g # Match as many instances as there are

Example

例子

Using that library:

使用该库:

import unraw from "unraw";

let step1 = unraw('http\u00253A\u00252F\u00252Fexample.com');
// yields "http%3A%2F%2Fexample.com"
// Then you can use decodeURIComponent to further decode it:
let step2 = decodeURIComponent(step1);
// yields http://example.com