如何使用 javascript 将特殊的 UTF-8 字符转换为其 iso-8859-1 等效字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5396560/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-23 17:00:48  来源:igfitidea点击:

How do I convert special UTF-8 chars to their iso-8859-1 equivalent using javascript?

javascriptjquerycharacter-encoding

提问by Hobhouse

I'm making a javascript app which retrieves .jsonfiles with jquery and injects data into the webpage it is embedded in.

我正在制作一个 javascript 应用程序,它.json使用 jquery检索文件并将数据注入它嵌入的网页中。

The .jsonfiles are encoded with UTF-8 and contains accented chars like é, ? and ?.

这些.json文件使用 UTF-8 编码并包含重音字符,如 é、? 和 ?。

The problem is that I don't control the charset on the pages that are going to use the app.

问题是我无法控制将要使用该应用程序的页面上的字符集。

Some will be using UTF-8, but others will be using the iso-8859-1 charset. This will of course garble the special chars from the .jsonfiles.

有些将使用 UTF-8,但其他将使用 iso-8859-1 字符集。这当然会使.json文件中的特殊字符乱码。

How do I convert special UTF-8 chars to their iso-8859-1 equivalent using javascript?

如何使用 javascript 将特殊的 UTF-8 字符转换为其 iso-8859-1 等效字符?

回答by nitro2k01

Actually, everything is typically stored as Unicode of some kind internally, but lets not go into that. I'm assuming you're getting the iconic "?¥?¤??" type strings because you're using an ISO-8859 as your character encoding. There's a trick you can do to convert those characters. The escapeand unescapefunctions used for encoding and decoding query strings are defined for ISO characters, whereas the newer encodeURIComponentand decodeURIComponentwhich do the same thing, are defined for UTF8 characters.

实际上,所有内容通常都在内部存储为某种 Unicode,但我们不要深入研究。我假设你得到了标志性的“?¥?¤??” 键入字符串,因为您使用 ISO-8859 作为字符编码。您可以使用一个技巧来转换这些字符。用于编码和解码查询字符串的escapeunescape函数是为 ISO 字符定义的,而较新的encodeURIComponentdecodeURIComponent执行相同操作的函数是为 UTF8 字符定义的。

escapeencodes extended ISO-8859-1 characters (UTF code points U+0080-U+00ff) as %xx(two-digit hex) whereas it encodes UTF codepoints U+0100 and above as %uxxxx(%ufollowed by four-digit hex.) For example, escape("?") == "%E5"and escape("あ") == "%u3042".

escape将扩展的 ISO-8859-1 字符(UTF 代码点 U+0080-U+00ff)%xx编码为(两位十六进制),而将 UTF 代码点 U+0100 及以上编码为%uxxxx%u后跟四位十​​六进制。)例如,escape("?") == "%E5"escape("あ") == "%u3042"

encodeURIComponentpercent-encodes extended characters as a UTF8 byte sequence. For example, encodeURIComponent("?") == "%C3%A5"and encodeURIComponent("あ") == "%E3%81%82".

encodeURIComponent将扩展字符百分比编码为 UTF8 字节序列。例如,encodeURIComponent("?") == "%C3%A5"encodeURIComponent("あ") == "%E3%81%82"

So you can do:

所以你可以这样做:

fixedstring = decodeURIComponent(escape(utfstring));

For example, an incorrectly encoded character "?" becomes "?¥". The command does escape("?¥") == "%C3%A5"which is the two incorrect ISO characters encoded as single bytes. Then decodeURIComponent("%C3%A5") == "?", where the two percent-encoded bytes are being interpreted as a UTF8 sequence.

例如,错误编码的字符“?” 变成“?¥”。该命令escape("?¥") == "%C3%A5"将两个不正确的 ISO 字符编码为单个字节。然后decodeURIComponent("%C3%A5") == "?",两个百分比编码的字节被解释为 UTF8 序列。

If you'd need to do the reverse for some reason, that works too:

如果您出于某种原因需要做相反的事情,那也可以:

utfstring = unescape(encodeURIComponent(originalstring));

Is there a way to differentiate between bad UTF8 strings and ISO strings? Turns out there is. The decodeURIComponent function used above will throw an error if given a malformed encoded sequence. We can use this to detect with a great probability whether our string is UTF8 or ISO.

有没有办法区分错误的 UTF8 字符串和 ISO 字符串?原来有。如果给定格式错误的编码序列,上面使用的 decodeURIComponent 函数将抛出错误。我们可以用它来检测我们的字符串是 UTF8 还是 ISO 的可能性很大。

var fixedstring;

try{
    // If the string is UTF-8, this will work and not throw an error.
    fixedstring=decodeURIComponent(escape(badstring));
}catch(e){
    // If it isn't, an error will be thrown, and we can assume that we have an ISO string.
    fixedstring=badstring;
}

回答by Diodeus - James MacFarlane

The problem is that once the page is served up, the content is going to be in the encoding described in the content-type meta tag. The content in "wrong" encoding is already garbled.

问题在于,一旦页面被提供,内容将采用内容类型元标记中描述的编码。“错误”编码的内容已经是乱码。

You're best to do this on the server before serving up the page. Or as I have been know to say: UTF-8 end-to-end or die.

您最好在提供页面之前在服务器上执行此操作。或者正如我所知:UTF-8 end-to-end or die

回答by Eldelshell

Since the questionon how to convert from ISO-8859-1 to UTF-8 is closed because of this one I'm going to post my solution here.

由于问题如何从ISO-8859-1转换为UTF-8,因为这一次我要在这里发布我的解决方案的关闭。

The problem is when you try to GET anything by using XMLHttpRequest, if the XMLHttpRequest.responseType is "text" or empty, the XMLHttpRequest.response is transformed to a DOMString and that's were things break up. After, it's almost impossible to reliably work with that string.

问题是当您尝试使用 XMLHttpRequest 获取任何内容时,如果 XMLHttpRequest.responseType 为“文本”或为空,则 XMLHttpRequest.response 将转换为 DOMString 并且这就是问题。之后,几乎不可能可靠地使用该字符串。

Now, if the content from the server is ISO-8859-1 you'll have to force the response to be of type "Blob" and later convert this to DOMSTring. For example:

现在,如果来自服务器的内容是 ISO-8859-1,您必须强制响应类型为“ Blob”,然后将其转换为 DOMSTring。例如:

var ajax = new XMLHttpRequest();
ajax.open('GET', url, true);
ajax.responseType = 'blob';
ajax.onreadystatechange = function(){
    ...
    if(ajax.responseType === 'blob'){
        // Convert the blob to a string
        var reader = new window.FileReader();
        reader.addEventListener('loadend', function() {
           // For ISO-8859-1 there's no further conversion required
           Promise.resolve(reader.result);
        });
        reader.readAsBinaryString(ajax.response);
    }
}

Seems like the magic is happening on readAsBinaryStringso maybe someone can shed some light on why this works.

似乎魔法发生在readAsBinaryString 上,所以也许有人可以解释一下为什么会这样。

回答by Martijn

Internally, Javascript strings are all Unicode (actually UCS-2, a subset of UTF-16).

在内部,Javascript 字符串都是 Unicode(实际上是 UCS-2,UTF-16 的子集)。

If you're retrieving the JSON files separately via AJAX, then you only need to make sure that the JSON files are served with the correct Content-Type and charset: Content-Type: application/json; charset="utf-8"). If you do that, jQuery should already have interpreted them properly by the time you access the deserialized objects.

如果您通过 AJAX 单独检索 JSON 文件,那么您只需要确保使用正确的 Content-Type 和字符集提供 JSON 文件:) Content-Type: application/json; charset="utf-8"。如果您这样做了,那么在您访问反序列化对象时,jQuery 应该已经正确解释了它们。

Could you post an example of the code you're using to retrieve the JSON objects?

您能否发布一个用于检索 JSON 对象的代码示例?

回答by Jose Solorzano

There are libraries that do charset conversion in Javascript. But if you want something simple, this function does approximately what you want:

有一些库可以在 Javascript 中进行字符集转换。但是如果你想要一些简单的东西,这个函数可以大致完成你想要的:

function stringToBytes(text) {
  const length = text.length;
  const result = new Uint8Array(length);
  for (let i = 0; i < length; i++) {
    const code = text.charCodeAt(i);
    const byte = code > 255 ? 32 : code;
    result[i] = byte;
  }
  return result;
}

If you want to convert the resulting byte array into a Blob, you would do something like this:

如果要将生成的字节数组转换为 Blob,可以执行以下操作:

const originalString = '???';
const bytes = stringToBytes(originalString);
const blob = new Blob([bytes.buffer], { type: 'text/plain; charset=ISO-8859-1' });

Now, keep in mind that some apps do accept UTF-8 encoding, but they can't guess the encoding unless you prepend a BOM character, as explained here.

现在,请记住,有些应用程序确实接受UTF-8编码,但除非你在前面加上一个BOM字符,他们无法猜测的编码,解释在这里

回答by user3309074

you should add this line above your page

您应该在页面上方添加此行

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />